[HN Gopher] AdapTive-LeArning Speculator System (ATLAS): Faster ...
___________________________________________________________________
AdapTive-LeArning Speculator System (ATLAS): Faster LLM inference
Author : alecco
Score : 185 points
Date : 2025-10-12 08:37 UTC (14 hours ago)
(HTM) web link (www.together.ai)
(TXT) w3m dump (www.together.ai)
| petesergeant wrote:
| > Built on top of Together Turbo Speculator, ATLAS reaches up to
| 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully
| adapted scenario -- 2.65x faster than standard decoding,
| outperforming even specialized hardware like Groq
|
| and yet, if you click on:
| https://openrouter.ai/moonshotai/kimi-k2-0905
|
| You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq
| and Cerebras often feel like the only games in town. I'd love
| that to be different (because I'd like more models!), but nobody
| else is coming close right now.
|
| Comparing how quickly gpt-oss-120b runs gives a broader picture:
| https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and
| SambaNova do pretty good on it too, but still, the difference
| between a top provider and an also-ran is giant.
|
| God I love OpenRouter.
| senko wrote:
| > You'll see Groq averaging 1,086tps
|
| What I don't understand is, Groq reporting 200tps for the same
| model:
| https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
|
| OpenRouter numbers look fishy.
| p1esk wrote:
| Do these numbers compare performance at the same cost?
| petesergeant wrote:
| You can see the cost in the links, and the answer is "pretty
| much" for the consumer. The backend maths, no idea.
| Havoc wrote:
| >Groq and Cerebras often feel like the only games in town.
|
| SambaNova should be similar...they've got a similar specialized
| hardware approach
| jbellis wrote:
| groq is quantizing, even though it's not labeled as such on
| openrouter (super frustrating)
| bn-l wrote:
| Do you have a source for that? They are pretty close to the
| ref implementation on moonshot's ranking
| immortal3 wrote:
| There's another angle to this comparison. Groq and Cerebras use
| custom chips, but I'm not sure about Together. In this case,
| Together is sharing results based on the B200 GPU. Another
| important point is the accuracy of these speed-ups compared to
| the baseline model. It's known that such tricks reduce
| accuracy, but by how much? Kimi has already benchmarked several
| providers.
| https://x.com/Kimi_Moonshot/status/1976926483319763130
| jsheard wrote:
| > Groq and Cerebras use custom chips
|
| Not just custom chips, but custom chips which derive much of
| their performance from enormous amounts of SRAM. There's no
| denying that approach is fast, but it's also incredibly
| expensive, and SRAM scaling has slowed to a crawl so it won't
| get much cheaper any time soon.
| petesergeant wrote:
| This is an "expensive for whom" question. I'd be keen to
| know if they're burning investor money hosting these right
| now or if they're able to run these at cost.
| rfoo wrote:
| > It's known that such tricks reduce accuracy
|
| AFAIU, speculative decoding (and this fancier version of
| spec. decoding) does not reduce accuracy.
| martinald wrote:
| No it shouldn't do. "All" you're doing is having a small
| model run the prompt and then have the large model "verify"
| it. When the large model diverges from the small one, you
| restart the process again.
| Der_Einzige wrote:
| It's quantization which is crippling accuracy...
| meander_water wrote:
| Interesting, if you take a look at the median throughput chart
| [0], groq goes insane after 7th Oct. Wonder what happened.
|
| [0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
| awestroke wrote:
| Heavy quantization
| petesergeant wrote:
| They claim (or someone on Reddit who claims to be staff
| claims) that's not accurate: https://www.reddit.com/r/Local
| LLaMA/comments/1mk4kt0/comment...
| sigmar wrote:
| 2x jump overnight. new LPU hardware? I checked the speed for
| groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none
| of them had a noticeable change this month
| KronisLV wrote:
| > I'd love that to be different (because I'd like more
| models!), but nobody else is coming close right now.
|
| I'm currently on the Cerebras Code subscription for like 50 USD
| a month because it more or less makes the rate limits I used to
| deal with other platforms disappear (without making me spend
| upwards of 100 USD paying per token):
| https://www.cerebras.ai/blog/introducing-cerebras-code
|
| At the same time, their Qwen Coder 480B model is _fine_ but I
| still find myself going for Claude or GPT-5 or Gemini 2.5 Pro
| for more complex issues (or ones where I need good usage of
| Latvian language), at least for programming tasks it 'd
| eventually be super cool if they could offer more models.
|
| Or have some sort of a partnership with Anthropic or whoever,
| because getting my questions answered at around 500-1500 TPS is
| really, really pleasant, especially for agentic use cases with
| code modifications, even if I still bump into the 128k context
| limits occasionally.
| alecco wrote:
| But Groq/Cerebras are hardware accelerators. It's an unrelated
| optimization. I wouldn't be surprised if they could also use
| speculators (today or in the future).
| ashvardanian wrote:
| Will need some time to go through the details, but it's
| increasingly rare to see teams consistently delivering meaningful
| improvements in the open. Impressive work!
| wishawa wrote:
| Inference is impressively fast. But what about quality? In the
| Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-
| Verifier/), Together has one of the highest tool call failure
| rates (>300 failures over the benchmark, compared to 0-2 for the
| official API, groq, SiliconFlow, and Infinigence).
| rfoo wrote:
| If you compare "schema validation error count" plus "Count of
| Finish Reason others" then SiliconFlow and Infinigence is in
| the same bucket too. Maybe their API layer detected incorrect
| tool call and set finish reason to something else?
|
| IMO this likely is what you get from running the model
| correctly as-is (i.e. using the same weight and activation
| dtype), so Together is not bad.
|
| Moonshot AI themselves and Groq likely uses some sampler tricks
| to eliminate schema validation errors.
|
| So really the only thing this shows is: Nebius, Chutes,
| AtlasCloud could be running something else (for example further
| quantized model). Or bugs.
| wishawa wrote:
| Fair point. If Moonshot is holding back the true weights or
| inference techniques that affect correctness, then providers
| including Together should call them out on that. I for one
| would stop using Kimi if that is the case.
|
| Anyway, Novita is doing significantly better on the vendor
| verifier chart than Together, so the low quality must be
| partially Together's fault at least.
| rfoo wrote:
| I don't think it's weight being different or special
| inference techniques, more like they are not able to train
| the model to follow tool schema perfectly yet, and both
| Moonshot and Groq decided to use something like
| https://github.com/noamgat/lm-format-enforcer to make sure
| at least the output format is correct.
| sailingparrot wrote:
| I don't know anything about Together quality in general, but
| the specific technique discussed here (speculative decoding)
| has no impact on the quality of generations. So you should be
| able to apply it to whichever model you want, and see the
| advertised speedup while retaining the quality of your base
| model.
| furyofantares wrote:
| > the specific technique discussed here (speculative
| decoding) has no impact on the quality of generations
|
| I don't see why that would be true. As I understand, the
| verifier is checking if the tokens are good-enough, not if
| they're the exact same tokens it would have selected. The
| predicted tokens could be consistently slightly worse, which
| could have a cascading effect to make the overall output a
| lot worse.
| sailingparrot wrote:
| > the verifier is checking if the tokens are good-enough,
| not if they're the exact same tokens it would have selected
|
| That's up to you, depends on how you implement it and how
| much you want to prioritize speed at the expense of
| quality, this is not an intrinsic attribute of speculative
| decoding. The verifier checks if the tokens predicted by
| the draft model are part of the top-k tokens predicted by
| the full size model at each steps. Set k to 1 and you will
| only accept perfect matches. Set k to > 1 and you will
| indeed start selecting "good enough" tokens, but will get
| faster inference.
|
| But no matter what value you choose for k, the technique
| described in the article can apply and will result in
| faster inference at no loss when compared to a setup
| without this technique, with the same value of k.
| buildbot wrote:
| It can be exact or not! Depends on the kind of sampling you
| are doing.
|
| You can do exact verification, and as soon as a token
| mismatches you reject everything after that token from your
| draft. Relaxed acceptance techniques measure how wrong that
| mispredicted token is via some metric, and accept it if
| it's close enough. So you get longer draft lengths with
| higher acceptance rates.
| gkapur wrote:
| Adding to the prior comments as my intuition matched yours,
| there's a nice Reddit thread that gives some context into
| how it can be faster even if you require exact matches:
| https://www.reddit.com/r/LocalLLaMA/s/ARxHLqRjdM
|
| The TLDR/key (from my understanding) is that verifying N
| tokens can be faster than generating N tokens.
| sailingparrot wrote:
| > The TLDR/key (from my understanding) is that verifying
| N tokens can be faster than generating N tokens.
|
| Yes. This is because to generate token n+1 you need token
| n etc. So generating from scratch is a sequential (thus
| slow) process. When we verify tokens, we can, for each
| token, use all preceding tokens as input and check that
| the output token matches the expectation. But since the
| full sequence we want to verify already exist, we can do
| it in parallel for each token we want to verify and not
| sequentially.
|
| This is why training transformer models is much faster
| than RNN, we do the same thing during training, it's just
| that the sequence we compare to is the ground truth and
| not coming from another model.
| wishawa wrote:
| I didn't know this! I've always thought speculative decoding
| was "if p(draft_token) > threshold, use it". You made me go
| read how it actually works and it's pretty neat!
|
| That said, I still think some providers are cheating. Please
| correct me if the test below is flawed.
|
| I generated texts at temperature = 0 vs temperature = 2. At
| high temperature, the distributions effectively become
| flatter, meaning the difference between real and draft
| effective distributions (the D_LK used in theorem 3.5 of
| 2211.17192) becomes smaller. When T=2, the model speaks
| complete gibberish, so the effective distribution must be
| pretty flat. This should mean fewer rejections --> a lot
| faster speculative decoding. Yet, I see no increase in
| throughput at all...
| sailingparrot wrote:
| Not sure exactly what setup you are running, in theory yes,
| higher temperature for both model means higher chance of
| overlap and thus less rejections -> faster sampling (but
| worse quality overall).
|
| However, if you have higher temperature but still are
| operating under a top-k sampling where k is small, not sure
| it's going to translate to any noticeable difference, since
| this will make your actual distributions very much non-
| uniform.
| wishawa wrote:
| This is with Together's API via OpenRouter, running
| DeepSeek V3 0324 and Kimi K2 0905.
|
| I didn't set a top-k. So it seems like Together must be
| doing something weird in their speculative decoding
| implementation.
| sailingparrot wrote:
| Oh in that case there is definitely a top-k or top-p
| behind the scene, it might just not be exposed to the
| user as a param they can change through their API. I
| haven't heard of anyone running a LLM in prod with actual
| pure sampling
| Havoc wrote:
| >a faster speculator (also known as the draft model) proposes
| multiple tokens ahead, and the target model verifies them in
| parallel in a single forward pass
|
| TIL. Bit of an aha moment - never understood till now how a big
| model can verify faster than it can generate
| woadwarrior01 wrote:
| As with almost everything else in CS, it's a tradeoff. Pre-fill
| is compute bound, decoding is memory bandwidth bound.
| Speculative decoding works when the draft model is more often
| right that wrong, because most architectures have a lot more
| compute, compared to memory bandwidth.
| andblac wrote:
| At first glance, this reminds me of how branch prediction is
| utilized in CPUs to speedup execution. As I understand it, this
| development is like a form of soft branch prediction over
| language trajectories: a small model predicts what the main model
| will do, takes few steps ahead and then verifies the results (and
| this can be done in parallel). If it checks out, you just jump
| forward, it not you take miss but its rare. I find it funny how
| small-big ideas like this come up in different context again and
| again in history of our technological development. Of course
| ideas as always are cheap. The hard part is how to actually use
| them and cash in on them.
| red2awn wrote:
| A lot of optimizations in LLMs now are low hanging fruits
| inspired by techniques in classical computer science. Another
| one that comes to mind is paged KV caching which is based on
| memory paging.
| LogicFailsMe wrote:
| No barrier to entry whatsoever? Backprop on the speculative
| decoding weights during inference to improve their accuracy on a
| per application basis?
|
| Cool hack though, kudos. Wonder if they can make Groq or Cerebras
| do the same thing?
| necovek wrote:
| So with a 4x speed-up, Together will give us at least 2x lower
| price for top-end models, right? :)
| jsutton97 wrote:
| I can't help but wonder how much longer we'll see this work
| shared openly.
| diamond559 wrote:
| Great, my slop memes can come out much faster now. This is the
| future of the world economy!
| hazrmard wrote:
| Do I understand this right?
|
| A light-weight speculative model adapts to usage, keeping the
| acceptance rate for the static heavy-weight model within
| acceptable bounds.
|
| Do they adapt with LoRAs?
___________________________________________________________________
(page generated 2025-10-12 23:00 UTC)