hngopher.com

       [HN Gopher] AdapTive-LeArning Speculator System (ATLAS): Faster ...
       ___________________________________________________________________
        
       AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference
        
       Author : alecco
       Score  : 185 points
       Date   : 2025-10-12 08:37 UTC (14 hours ago)
        
 (HTM) web link (www.together.ai)
 (TXT) w3m dump (www.together.ai)
        
       | petesergeant wrote:
       | > Built on top of Together Turbo Speculator, ATLAS reaches up to
       | 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully
       | adapted scenario -- 2.65x faster than standard decoding,
       | outperforming even specialized hardware like Groq
       | 
       | and yet, if you click on:
       | https://openrouter.ai/moonshotai/kimi-k2-0905
       | 
       | You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq
       | and Cerebras often feel like the only games in town. I'd love
       | that to be different (because I'd like more models!), but nobody
       | else is coming close right now.
       | 
       | Comparing how quickly gpt-oss-120b runs gives a broader picture:
       | https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and
       | SambaNova do pretty good on it too, but still, the difference
       | between a top provider and an also-ran is giant.
       | 
       | God I love OpenRouter.
        
         | senko wrote:
         | > You'll see Groq averaging 1,086tps
         | 
         | What I don't understand is, Groq reporting 200tps for the same
         | model:
         | https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
         | 
         | OpenRouter numbers look fishy.
        
         | p1esk wrote:
         | Do these numbers compare performance at the same cost?
        
           | petesergeant wrote:
           | You can see the cost in the links, and the answer is "pretty
           | much" for the consumer. The backend maths, no idea.
        
         | Havoc wrote:
         | >Groq and Cerebras often feel like the only games in town.
         | 
         | SambaNova should be similar...they've got a similar specialized
         | hardware approach
        
         | jbellis wrote:
         | groq is quantizing, even though it's not labeled as such on
         | openrouter (super frustrating)
        
           | bn-l wrote:
           | Do you have a source for that? They are pretty close to the
           | ref implementation on moonshot's ranking
        
         | immortal3 wrote:
         | There's another angle to this comparison. Groq and Cerebras use
         | custom chips, but I'm not sure about Together. In this case,
         | Together is sharing results based on the B200 GPU. Another
         | important point is the accuracy of these speed-ups compared to
         | the baseline model. It's known that such tricks reduce
         | accuracy, but by how much? Kimi has already benchmarked several
         | providers.
         | https://x.com/Kimi_Moonshot/status/1976926483319763130
        
           | jsheard wrote:
           | > Groq and Cerebras use custom chips
           | 
           | Not just custom chips, but custom chips which derive much of
           | their performance from enormous amounts of SRAM. There's no
           | denying that approach is fast, but it's also incredibly
           | expensive, and SRAM scaling has slowed to a crawl so it won't
           | get much cheaper any time soon.
        
             | petesergeant wrote:
             | This is an "expensive for whom" question. I'd be keen to
             | know if they're burning investor money hosting these right
             | now or if they're able to run these at cost.
        
           | rfoo wrote:
           | > It's known that such tricks reduce accuracy
           | 
           | AFAIU, speculative decoding (and this fancier version of
           | spec. decoding) does not reduce accuracy.
        
             | martinald wrote:
             | No it shouldn't do. "All" you're doing is having a small
             | model run the prompt and then have the large model "verify"
             | it. When the large model diverges from the small one, you
             | restart the process again.
        
             | Der_Einzige wrote:
             | It's quantization which is crippling accuracy...
        
         | meander_water wrote:
         | Interesting, if you take a look at the median throughput chart
         | [0], groq goes insane after 7th Oct. Wonder what happened.
         | 
         | [0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
        
           | awestroke wrote:
           | Heavy quantization
        
             | petesergeant wrote:
             | They claim (or someone on Reddit who claims to be staff
             | claims) that's not accurate: https://www.reddit.com/r/Local
             | LLaMA/comments/1mk4kt0/comment...
        
           | sigmar wrote:
           | 2x jump overnight. new LPU hardware? I checked the speed for
           | groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none
           | of them had a noticeable change this month
        
         | KronisLV wrote:
         | > I'd love that to be different (because I'd like more
         | models!), but nobody else is coming close right now.
         | 
         | I'm currently on the Cerebras Code subscription for like 50 USD
         | a month because it more or less makes the rate limits I used to
         | deal with other platforms disappear (without making me spend
         | upwards of 100 USD paying per token):
         | https://www.cerebras.ai/blog/introducing-cerebras-code
         | 
         | At the same time, their Qwen Coder 480B model is _fine_ but I
         | still find myself going for Claude or GPT-5 or Gemini 2.5 Pro
         | for more complex issues (or ones where I need good usage of
         | Latvian language), at least for programming tasks it 'd
         | eventually be super cool if they could offer more models.
         | 
         | Or have some sort of a partnership with Anthropic or whoever,
         | because getting my questions answered at around 500-1500 TPS is
         | really, really pleasant, especially for agentic use cases with
         | code modifications, even if I still bump into the 128k context
         | limits occasionally.
        
         | alecco wrote:
         | But Groq/Cerebras are hardware accelerators. It's an unrelated
         | optimization. I wouldn't be surprised if they could also use
         | speculators (today or in the future).
        
       | ashvardanian wrote:
       | Will need some time to go through the details, but it's
       | increasingly rare to see teams consistently delivering meaningful
       | improvements in the open. Impressive work!
        
       | wishawa wrote:
       | Inference is impressively fast. But what about quality? In the
       | Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-
       | Verifier/), Together has one of the highest tool call failure
       | rates (>300 failures over the benchmark, compared to 0-2 for the
       | official API, groq, SiliconFlow, and Infinigence).
        
         | rfoo wrote:
         | If you compare "schema validation error count" plus "Count of
         | Finish Reason others" then SiliconFlow and Infinigence is in
         | the same bucket too. Maybe their API layer detected incorrect
         | tool call and set finish reason to something else?
         | 
         | IMO this likely is what you get from running the model
         | correctly as-is (i.e. using the same weight and activation
         | dtype), so Together is not bad.
         | 
         | Moonshot AI themselves and Groq likely uses some sampler tricks
         | to eliminate schema validation errors.
         | 
         | So really the only thing this shows is: Nebius, Chutes,
         | AtlasCloud could be running something else (for example further
         | quantized model). Or bugs.
        
           | wishawa wrote:
           | Fair point. If Moonshot is holding back the true weights or
           | inference techniques that affect correctness, then providers
           | including Together should call them out on that. I for one
           | would stop using Kimi if that is the case.
           | 
           | Anyway, Novita is doing significantly better on the vendor
           | verifier chart than Together, so the low quality must be
           | partially Together's fault at least.
        
             | rfoo wrote:
             | I don't think it's weight being different or special
             | inference techniques, more like they are not able to train
             | the model to follow tool schema perfectly yet, and both
             | Moonshot and Groq decided to use something like
             | https://github.com/noamgat/lm-format-enforcer to make sure
             | at least the output format is correct.
        
         | sailingparrot wrote:
         | I don't know anything about Together quality in general, but
         | the specific technique discussed here (speculative decoding)
         | has no impact on the quality of generations. So you should be
         | able to apply it to whichever model you want, and see the
         | advertised speedup while retaining the quality of your base
         | model.
        
           | furyofantares wrote:
           | > the specific technique discussed here (speculative
           | decoding) has no impact on the quality of generations
           | 
           | I don't see why that would be true. As I understand, the
           | verifier is checking if the tokens are good-enough, not if
           | they're the exact same tokens it would have selected. The
           | predicted tokens could be consistently slightly worse, which
           | could have a cascading effect to make the overall output a
           | lot worse.
        
             | sailingparrot wrote:
             | > the verifier is checking if the tokens are good-enough,
             | not if they're the exact same tokens it would have selected
             | 
             | That's up to you, depends on how you implement it and how
             | much you want to prioritize speed at the expense of
             | quality, this is not an intrinsic attribute of speculative
             | decoding. The verifier checks if the tokens predicted by
             | the draft model are part of the top-k tokens predicted by
             | the full size model at each steps. Set k to 1 and you will
             | only accept perfect matches. Set k to > 1 and you will
             | indeed start selecting "good enough" tokens, but will get
             | faster inference.
             | 
             | But no matter what value you choose for k, the technique
             | described in the article can apply and will result in
             | faster inference at no loss when compared to a setup
             | without this technique, with the same value of k.
        
             | buildbot wrote:
             | It can be exact or not! Depends on the kind of sampling you
             | are doing.
             | 
             | You can do exact verification, and as soon as a token
             | mismatches you reject everything after that token from your
             | draft. Relaxed acceptance techniques measure how wrong that
             | mispredicted token is via some metric, and accept it if
             | it's close enough. So you get longer draft lengths with
             | higher acceptance rates.
        
             | gkapur wrote:
             | Adding to the prior comments as my intuition matched yours,
             | there's a nice Reddit thread that gives some context into
             | how it can be faster even if you require exact matches:
             | https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM
             | 
             | The TLDR/key (from my understanding) is that verifying N
             | tokens can be faster than generating N tokens.
        
               | sailingparrot wrote:
               | > The TLDR/key (from my understanding) is that verifying
               | N tokens can be faster than generating N tokens.
               | 
               | Yes. This is because to generate token n+1 you need token
               | n etc. So generating from scratch is a sequential (thus
               | slow) process. When we verify tokens, we can, for each
               | token, use all preceding tokens as input and check that
               | the output token matches the expectation. But since the
               | full sequence we want to verify already exist, we can do
               | it in parallel for each token we want to verify and not
               | sequentially.
               | 
               | This is why training transformer models is much faster
               | than RNN, we do the same thing during training, it's just
               | that the sequence we compare to is the ground truth and
               | not coming from another model.
        
           | wishawa wrote:
           | I didn't know this! I've always thought speculative decoding
           | was "if p(draft_token) > threshold, use it". You made me go
           | read how it actually works and it's pretty neat!
           | 
           | That said, I still think some providers are cheating. Please
           | correct me if the test below is flawed.
           | 
           | I generated texts at temperature = 0 vs temperature = 2. At
           | high temperature, the distributions effectively become
           | flatter, meaning the difference between real and draft
           | effective distributions (the D_LK used in theorem 3.5 of
           | 2211.17192) becomes smaller. When T=2, the model speaks
           | complete gibberish, so the effective distribution must be
           | pretty flat. This should mean fewer rejections --> a lot
           | faster speculative decoding. Yet, I see no increase in
           | throughput at all...
        
             | sailingparrot wrote:
             | Not sure exactly what setup you are running, in theory yes,
             | higher temperature for both model means higher chance of
             | overlap and thus less rejections -> faster sampling (but
             | worse quality overall).
             | 
             | However, if you have higher temperature but still are
             | operating under a top-k sampling where k is small, not sure
             | it's going to translate to any noticeable difference, since
             | this will make your actual distributions very much non-
             | uniform.
        
               | wishawa wrote:
               | This is with Together's API via OpenRouter, running
               | DeepSeek V3 0324 and Kimi K2 0905.
               | 
               | I didn't set a top-k. So it seems like Together must be
               | doing something weird in their speculative decoding
               | implementation.
        
               | sailingparrot wrote:
               | Oh in that case there is definitely a top-k or top-p
               | behind the scene, it might just not be exposed to the
               | user as a param they can change through their API. I
               | haven't heard of anyone running a LLM in prod with actual
               | pure sampling
        
       | Havoc wrote:
       | >a faster speculator (also known as the draft model) proposes
       | multiple tokens ahead, and the target model verifies them in
       | parallel in a single forward pass
       | 
       | TIL. Bit of an aha moment - never understood till now how a big
       | model can verify faster than it can generate
        
         | woadwarrior01 wrote:
         | As with almost everything else in CS, it's a tradeoff. Pre-fill
         | is compute bound, decoding is memory bandwidth bound.
         | Speculative decoding works when the draft model is more often
         | right that wrong, because most architectures have a lot more
         | compute, compared to memory bandwidth.
        
       | andblac wrote:
       | At first glance, this reminds me of how branch prediction is
       | utilized in CPUs to speedup execution. As I understand it, this
       | development is like a form of soft branch prediction over
       | language trajectories: a small model predicts what the main model
       | will do, takes few steps ahead and then verifies the results (and
       | this can be done in parallel). If it checks out, you just jump
       | forward, it not you take miss but its rare. I find it funny how
       | small-big ideas like this come up in different context again and
       | again in history of our technological development. Of course
       | ideas as always are cheap. The hard part is how to actually use
       | them and cash in on them.
        
         | red2awn wrote:
         | A lot of optimizations in LLMs now are low hanging fruits
         | inspired by techniques in classical computer science. Another
         | one that comes to mind is paged KV caching which is based on
         | memory paging.
        
       | LogicFailsMe wrote:
       | No barrier to entry whatsoever? Backprop on the speculative
       | decoding weights during inference to improve their accuracy on a
       | per application basis?
       | 
       | Cool hack though, kudos. Wonder if they can make Groq or Cerebras
       | do the same thing?
        
       | necovek wrote:
       | So with a 4x speed-up, Together will give us at least 2x lower
       | price for top-end models, right? :)
        
       | jsutton97 wrote:
       | I can't help but wonder how much longer we'll see this work
       | shared openly.
        
       | diamond559 wrote:
       | Great, my slop memes can come out much faster now. This is the
       | future of the world economy!
        
       | hazrmard wrote:
       | Do I understand this right?
       | 
       | A light-weight speculative model adapts to usage, keeping the
       | acceptance rate for the static heavy-weight model within
       | acceptable bounds.
       | 
       | Do they adapt with LoRAs?
        
       ___________________________________________________________________
       (page generated 2025-10-12 23:00 UTC)