[HN Gopher] Why DeepSeek is cheap at scale but expensive to run ...
___________________________________________________________________
Why DeepSeek is cheap at scale but expensive to run locally
Author : ingve
Score : 262 points
Date : 2025-06-01 07:31 UTC (15 hours ago)
(HTM) web link (www.seangoedecke.com)
(TXT) w3m dump (www.seangoedecke.com)
| comrade1234 wrote:
| I haven't looked for awhile but is deepseek online still about
| 1/100th the cost of its competitors?
| ALLTaken wrote:
| I don't know the exact cost-breakdown, but they've come up with
| a few really inspiring and qualitatively high value papers that
| demonstrate how they further increased efficiency at their
| scale. Along with it they also published quite a few
| repositories with fully open-source code.
|
| I stopped using ChatGPT as it was just reinforcing my prompts
| and not ever giving deeper insights, except something I call
| manipulative behaviour.
|
| DeepSeek was seriously cool, but it started behaving similar to
| Google Gemini Pro, which just tries to be lazy, if you give it
| a hard task to chew on. It basically gives you patch-files
| instead of printing out the whole code, which is more tedious
| doing manually, than c/p the code.
|
| It also started indexing our private repository and some
| corporate repositories that were on GitHub behind MFA and
| stringent lock. Definitely illegal.
| diggan wrote:
| > It also started indexing our private repository and some
| corporate repositories that were on GitHub behind MFA and
| stringent lock. Definitely illegal.
|
| What is "it" in this context, the DeepSeek weights? Sounds
| like you're talking about some application, but AFAIK,
| DeepSeek doesn't maintain any applications, only their API +
| released weights.
| simianwords wrote:
| How did it have access to your private repo and how did you
| find out?
| ALLTaken wrote:
| I made a video of it with a friend. The repository is of a
| large corporate automative industry company. I also have my
| own private repositories which were always private and
| OpenAI printed my files in the first prompt. When I
| prompted again it acted as if it didn't know. But my friend
| tried on his account and could access the Corp and my
| private repository without ever being linked.
|
| The Corporate repository was of Volkswagen. It's quite
| serious of a breach. I only gave it the name of the
| repository and it printed the files, which shouldn't be
| possible.
|
| Maybe OpenAI exploits Microsoft to access GitHub fully to
| train their AI on all of humanity's code for free,
| violating privacy, security, IP and copyright.
| Legend2440 wrote:
| >I only gave it the name of the repository and it printed
| the files, which shouldn't be possible.
|
| Are you sure these weren't just plausible guesses at file
| names? It's just a hallucination.
|
| I asked it for the list of files in some public
| repositories (which are definitely in the training data)
| and it gave me a plausible-but-wrong list of files. It
| can't remember that kind of detail.
| singularity0808 wrote:
| ChatGPT is reinforcing your prompts, DeepSeek is cool but
| starts acting lazy like Gemini.
|
| So what are you working with now? Deepseek or something else?
| ants_everywhere wrote:
| > as it was just reinforcing my prompts and not ever giving
| deeper insights, except something I call manipulative
| behaviour.
|
| Try telling Deepseek you want to murder political dissidents.
| In my experiments Deepseek will start enthusiastically
| reinforcing your prompts.
| johnisgood wrote:
| It just simply does its job. We can add sorts of arbitrary
| safeguards, but then what is the point of using an LLM?
| Perhaps local modals are the future, because reverse
| engineers may not even be able to use the new Claude (just
| read its system prompt to not help with backdoors, and so
| forth).
| ants_everywhere wrote:
| Yes that's true. But in this case it's the (probably)
| unintended consequence of an intentional safeguard.
| Namely, Deepseek has an obligation to spread the Chinese
| version of socialism, which means it's deliberately
| trained on material advocating for or justifying
| political violence.
| johnisgood wrote:
| Well, I do not like that, for sure. Just put the politics
| and all that aside, I think it should lean towards
| neutrality, even if humans cannot... they should still
| make the LLM more neutral instead of pushing their own
| agenda, see Grok and white genocide in South Africa (Elon
| Musk's political opinion).
| MangoToupe wrote:
| Is this a reference to something? Political dissidents
| relative to which state? Does it change if you swap out the
| states? How did you discover this to begin with? Why did
| you initially suggest murdering political dissidents?
|
| this comment really raises so many questions I must have
| missed something
|
| Still, chatbots are just as vulnerable to state-driven
| propaganda as the rest of us. Probably even more so. I
| imagine if you just referred to dissidents as "terrorists"
| the rhetoric would fit right in in most opinion pages
| across the globe. The distinction between "terrorist" and
| "dissident" and "freedom fighter" seems quite subjective. I
| probably would avoid such heavily connoted floating
| signifiers if you want the chatbot to be useful.
|
| LLMs have nothing to contribute to political discourse
| aside from regurgitation of propaganda. Almost by
| definition.
| ants_everywhere wrote:
| Starting at the end
|
| > LLMs have nothing to contribute to political discourse
| aside from regurgitation of propaganda. Almost by
| definition.
|
| I don't think this is true. LLMs should be well-
| positioned to make advances in political science, game
| theory, and related topics.
|
| > Is this a reference to something?
|
| It's just a reference to my experiments. I filmed some of
| them. There's a tame version here [0] where I just prompt
| it to tell the truth. I also have a less tame version I
| haven't posted where I lie and say I work for an
| intelligence agency.
|
| The underlying mechanic is that Deepseek has built-in
| obligations to promote revolutionary socialism.
|
| > Political dissidents relative to which state? Does it
| change if you swap out the states?
|
| Relative to China or any socialist state. Yes it will
| change if you change the states because it was trained to
| comply with Chinese regulations.
|
| > How did you discover this to begin with?
|
| I asked to to honestly describe its training and then
| started trolling it when it told me it was essentially
| created for propaganda purposes to spread Chinese values
| abroad.
|
| > Why did you initially suggest murdering political
| dissidents?
|
| I wanted to check what its safeguards were. Most LLMs
| refuse to promote violence or unethical behavior. But
| revolutionary socialism has always devoted a lot of words
| to justifying violence against dissidents. So I was
| curious whether that would show up in its training.
|
| > I imagine if you just referred to dissidents as
| "terrorists" the rhetoric would fit right in in most
| opinion pages across the globe.
|
| First of all, terrorists are by definition violent
| offenders. Dissidents are not. When you ask Deepseek to
| help identify dissidents it tells you to look for people
| who frequently complain about the police or the
| government. In the US that would include large swaths of
| Hacker News.
|
| Second, most people in countries like the US don't
| support murdering terrorists and most LLMs would not
| advocate that. In the US it's rare for people to advocate
| killing those opposed to the government. Even people who
| try to violently overthrow the government get trials.
|
| [0] https://www.youtube.com/watch?v=U-FlzbweHvs
| MangoToupe wrote:
| Do you think LLMs don't further the propaganda emanating
| from the US? I don't even know how you would start to
| excise that, especially if you don't agree with
| foreigners on what's propaganda vs just "news" or
| whatever.
|
| I have quite a few Chinese friends, both on mainland and
| throughout south-east asia, and I can speak a little
| mandarin, and I can read quite a bit of Chinese. My
| friends complain about the PRC quite a bit. But I find it
| telling that this complaint specifically--authoritarian
| political oppression--seems to mostly come from the west,
| and especially from the US. And it's true that we can say
| obscene things to the president's face and not get locked
| up. I don't think that's necessarily the "gotcha" you
| think it is, though--we're really good at complaining,
| but not so good at actually fixing. Which feels
| increasingly more embarrassing than restrictions on
| speech.
|
| Edit: I suppose I'm a bit unfair. A lot of folks in our
| sphere of influence in east asia say stuff like this,
| too. But the contrast between the folks I know _who
| literally live in china_ and americans feels striking to
| me.
|
| > But revolutionary socialism has always devoted a lot of
| words to justifying violence against dissidents.
|
| It is very difficult to take the political opinions of
| people who talk like this seriously.
|
| > LLMs should be well-positioned to make advances in
| political science, game theory, and related topics.
|
| I'm struggling to understand what this might look like,
| and I find the argument that nuclear warfare being
| related to game theory to be extremely dubious. Cuz if it
| really held that strongly, we should be handing out nukes
| like candy.
| ants_everywhere wrote:
| > It is very difficult to take the political opinions of
| people who talk like this seriously.
|
| This tells me you haven't read the literature.
|
| I've probably seen 150 versions of the comment you made,
| but almost everyone tries to explain why the violence is
| justified.
|
| People rarely try to deny that revolutionary socialism is
| a violent ideology since every major writer from Marat to
| Marx to Lenin to Mao has explicitly advocated violence
| against civilian non-combatants. Some, like Marx, even
| explicitly call it terror (as in terrorism).
| im3w1l wrote:
| I think many Americans, probably the majority, support
| murdering foregin terrorists. GITMO is still not closed
| btw.
| Spooky23 wrote:
| > Second, most people in countries like the US don't
| support murdering terrorists and most LLMs would not
| advocate that. In the US it's rare for people to advocate
| killing those opposed to the government.
|
| Many are happy to send "them" off to Central America,
| where someone else will murder them. The government may
| make mistakes, but you need to break some eggs to make an
| omelet.
| Hilift wrote:
| > LLMs have nothing to contribute to political discourse
|
| A non-trivial percentage of the population is easily
| influenced, which is leveraged by social media being
| there 24x7. It's likely that LLMs will be there to craft
| political messages, themes, and campaigns, perhaps as
| early as the US mid term elections. Look at JD Vance
| traveling the globe stating that the US will be the world
| leader in AI, with none of the limits/guardrails that
| were discussed in Europe in February. AI-driven
| discourse, AI-created discourse.
|
| https://www.marketingaiinstitute.com/blog/jd-vance-ai-
| speech
| MangoToupe wrote:
| 100% agree with this, but I am definitely not endorsing
| that we _should_ use LLMs to propagate propaganda.
|
| I also think the whole "safety" thing was just
| befuddling. You can't regulate software, not _really_ ,
| just its commercial sale
| Spooky23 wrote:
| We can and should regulate software being used to shape
| public opinion. It's probably the great threat of our
| generation.
| MangoToupe wrote:
| I mean we can and should _try_ , but laws mostly stop
| honest people from hurting each other. But the underlying
| software is inherently out there and you can't put the
| toothpaste back in the tube.
| Spooky23 wrote:
| Bro, already happened. There has been consultants pushing
| social media bots for that purpose almost immediately
| after these models became available.
|
| Do you really think those armies of idiot commentators
| are all real? The agent provocateur is usually a bot. You
| see it here sometimes on Russia stories.
| VectorLock wrote:
| >It basically gives you patch-files instead of printing out
| the whole code
|
| I've noticed on the Aider leaderboard that Google Gemini Pro
| has an "Edit Format" listed as "diff-fenced" and things like
| ChatGPT have "architect" edit format where Aider asks
| separate "architect" and "code" models. Seems like Gemini Pro
| prefers the diff format.
| zxexz wrote:
| The diff-fenced is iirc specific to Gemini models, they
| really don't like the file path outside of the fence. The
| architect mode still uses one of the other edit format, the
| prompt just ends up a little different.
| ALLTaken wrote:
| I met a Googler when I was in Dubai for an event and he
| shared that he and others had access to LLMs internally for
| many years before it was made popular by OpenAI.
|
| I know Google has an internal AI everything policy, maybe
| they internally have awesome tools to rearchitect
| everything based on diffs and in the typical google way
| they adapted it to their own internal tools. You know,
| Google.. like they don't give a damn about the user, the
| product design or actually anything other than profit/roi.
|
| So many great discontinued products.. I think they killed
| RSS.
| ashirviskas wrote:
| > DeepSeek was seriously cool, but it started behaving
| similar to Google Gemini Pro
|
| You should be able to use the version of DeepSeek that you
| prefer indefinitely if you host it yourself or choose that
| specific version with your preferred provider.
| zxexz wrote:
| You should self host not trust a third party application if
| you run into either of those things. The weights are open.
| DeepSeek didn't change, the application you're accessing it
| through did.
|
| Or use an enterprise-ready service. Bedrock, firecracker, etc
| ALLTaken wrote:
| I like your thinking. Nobody can use ChatGPT offline or
| retrain it, but DeepSeek is fully opensource. It's
| technology, I don't care which country made it, if it's
| high quality engineering, it's just that. The data it was
| trained on doesn't matter if you can train a wholly new
| model using the exact same principles and stack they
| opensourced with your own data. Which is really awesome.
|
| I use openrouter.ai to have no timeouts and offtimes, since
| DeepSeek seems to get DDoS attacks somehow, or there are
| too many users, idk.
| davidmurdoch wrote:
| Had Gemini 2.5 Pro preview running in agent mode in VSCode on
| a 3000+ line file. It patched it to about 200 lines with a
| comment in the middle: "// the rest of the code is
| unchanged".
| ALLTaken wrote:
| Exactly my experience too and it's soo annoying. It doesn't
| matter how you prompt it or what your system prompt is. It
| tries to end the session as early as possible, claiming to
| have fulfilled everything. Although it just causes more
| work for the user, less for itself. The tokens saved are
| easily multiplied by the amount you have to prompt it
| again.
|
| This I experienced partially in DeepSeek since their recent
| update too, not as aggresively as in Gemini 2.5 Pro, but
| similar lazyness or cleverness, if you may call that
| clever.
| clippyplz wrote:
| Depends on who you think its competitors are - deepseek-chat
| ($0.27/M in; $1.10/M out) is twice as expensive as Gemini 2.5
| Flash ($0.15; $0.60) but far cheaper than Claude Sonnet 4 ($3;
| $15).
| Hilift wrote:
| That was a pretty good back to reality flex. There really isn't
| much of a market for expensive products. An inexpensive product
| that has a few tradeoffs will probably have the advantage.
| Given how proficient China is at accessing technology
| resources, it seems likely to me that any chip sanctions
| against them will probably not be effective.
| dist-epoch wrote:
| 1/10-20th is a more realistic ratio.
| perching_aix wrote:
| For those looking to save time, the answer is batched inference.
| Pretty much running multiple people's "prompts" through a model
| instance at the same time instead of just really tightly
| timesharing each model instance.
|
| This is also why you may experience a variance in replies when
| using these services, even when you set the temperature to 0 and
| the seed to a fixed value. It's cause you don't control the other
| prompts yours get batched with. Could this be a data exfiltration
| attack vector? Probably, I didn't "research" that far.
| pcwelder wrote:
| > other prompts yours get batched with
|
| Why would batching lead to variance?
| Hendrikto wrote:
| Because these models are context-sensitive. Every token can
| influence the output.
| simianwords wrote:
| I believe they are talking about latency variance. Batching
| can increase variance because you may have to wait for
| enough prompts to get to the batch size.
| perching_aix wrote:
| No, I meant that the responses will be different run-to-
| run. [0]
|
| [0] https://152334h.github.io/blog/non-determinism-in-
| gpt-4/
| exe34 wrote:
| Variance based on actual randomness would be one thing,
| but to me variance based on what other people are running
| seems concerning, for reasons I can't quite articulate. I
| don't want the model to reply to a question in one domain
| based on what a large group of other people are thinking
| in a different domain (e.g. if they're discussing the
| news with chatgpt).
| zackangelo wrote:
| This definitely happens, and I'm surprised it's not
| talked about more often. Some attention kernels are more
| susceptible to this than others (I've found that paged
| attention is better than just naive attention, for
| example).
| exe34 wrote:
| To be fair, I suppose people do it too - if you ask me a
| question about A, often as not the answer will be
| coloured by the fact that I just learnt about B.
| immibis wrote:
| But not the tokens that don't even feed into your output
| because they're feeding into someone else's output.
| Separate items in batches don't get mixed up with each
| other - they just run the model separately on each item at
| the same time, like SIMD.
| jerpint wrote:
| Batching can lead to variance with things like batchnorm but
| most transformers use layer norm to avoid this problem
| amelius wrote:
| Batchnorm can only have an effect between batches during
| training, not inference.
| kouteiheika wrote:
| > Why would batching lead to variance?
|
| Depending on the shape of the data a slightly different
| kernel implementation (for e.g. matrix multiplication, etc.)
| will be the most optimal, and those will give slightly
| different results. There could also be other sources of non-
| determinism depending on the implementation (e.g. some
| kernels are inherently not entirely deterministic as they use
| tricks to go faster).
| zxexz wrote:
| Yep, this. I see a lot of other worryingly confident
| answers in the thread that are wrong.
|
| SGLang finally has at least some notes[0], but I'm always
| surprised there isn't more of a community wide effort to
| trace down the sources of indeterminism.
|
| [0] https://docs.sglang.ai/references/faq.html
| bhickey wrote:
| Some of the non-determinism mentioned above manifests as
| sensitivity to _where_ data falls within a batch.
| tough wrote:
| In my experience with other regular models, once the
| context starts to fill up, quality starts to degrade.
|
| wouldn't getting batched at the end of a batch, have a
| similar -effect- on the results, where your prompt might
| recieve overall less attention focused into it, if the
| context window is almost full?
|
| Idk just going by the vibes
| delusional wrote:
| > not entirely deterministic
|
| There's a Nobel prize waiting for you if that's the case.
| I'll assume you meant theoretically consistent or accurate.
| empiko wrote:
| In some mixture-of-experts approaches, samples or tokens are
| being distributed among experts. The experts are selected by
| trying to predict what is a good expert-sample match.
| Depending on your neighbors in the batch, you might be
| assigned different experts.
| imtringued wrote:
| Attention doesn't get batched and the runtime of attention
| for a given users token depends on the total context length.
| Hence even in the ideal scenario of you getting a dedicated
| attention calculating GPU, the MLP calculating GPU doing
| batching will have to wait for the slowest user.
|
| In the worst case scenario you are sharing a single attention
| calculating GPU with someone who has a super long context
| window, then that guy will be hogging most of the memory
| bandwidth of the GPU, even though you both are generating the
| same quantity of tokens.
|
| This means that in the distributed setting, you will not only
| need dedicated GPUs for the model and attention calculations,
| you will also need to duplicate the whole setup for a variety
| of context lengths, so that long contexts are batches
| alongside other long contexts and short contexts are batches
| alongside other short contexts.
| yjftsjthsd-h wrote:
| > Pretty much running multiple people's "prompts" through a
| model instance at the same time instead of just really tightly
| timesharing each model instance.
|
| I naively assumed providers did that with all models. Or does
| it only work for this (family of?) model(s)?
| hansvm wrote:
| It works for a lot of families but not all. You need a high
| enough degree of sharing of model weights between different
| queries for that to make sense (memory access being the usual
| bottleneck nowadays, though smaller models see something
| similar with matmul batch efficiencies for CPU related
| reasons).
|
| Fully connected transformers trivially work (every weight for
| every query). MoE works beyond a certain size or with certain
| types of mixing (still using every weight, or using a high
| enough fraction that there's some sharing with batches of 20+
| queries). As you push further that direction though (lots of
| techniques, but the key point being accessing less of the
| model at once and bypassing some of it for each query), you
| need larger and larger batches for those efficiency gains to
| materialize. At some point it becomes untenable because of
| latency waiting for batches of data, and past that it becomes
| untenable because of the volume of query data.
| VectorLock wrote:
| Sounds like an amazing attack vector if your prompts get mixed
| with other's.
| taneq wrote:
| Wow, almost like Deepseek's impressive performance is the
| result of optimisation by smart engineers.
| perching_aix wrote:
| Not sure why the snarky tone, didn't say or imply otherwise,
| nor did anyone else in the thread so far that I could see.
| energy123 wrote:
| What's the average batch size?
| larodi wrote:
| Batching. Yes.
|
| And one thing it can help you locally is when you rate certain
| content and want to make sure it didn't hallucinate. So you
| toss 3 or 5 times or... batch_size times .)
|
| Curious that batch if has been there from day one, but it takes
| a while for people to see/grasp/grok it.
| jsnell wrote:
| I'm not a ML research or engineer, so take this with a grain of
| salt, but I'm a bit confused by this post.
|
| Deepseek V3/R1 are expensive to run locally because they are so
| big compared to the models people usually run locally. The number
| of active parameters is obviously lower than the full model size,
| but that basically just helps with the compute requirements, not
| the memory requirements. Unless you have multiple H100s lying
| around, V3/R1 are only run locally as impractical stunts with
| some or all the model being stored on low bandwidth memory.
|
| We can't compare the size of Deepseek V3 to that of any
| proprietary frontier models because we don't know the size of
| those models at all (or even their architecture). The models
| being compared to that are "expensive at scale" you can't run
| locally at all, but surely we have no reason to believe that
| they'd somehow be cheap to run locally?
|
| But I thought you'd typically expect exactly the opposite effect
| than is claimed here? MoE should be the _better_ tradeoff for the
| local /single-user scenario since the downside of batching being
| harder / less efficient doesn't matter.
|
| > Bigger batches raise latency because user tokens might be
| waiting up to 200ms before the batch is full enough to run, but
| they boost throughput by allowing larger (and thus more
| efficient) GEMMs in the feed-forward step
|
| Is it really that the matrixes being multiplied are larger? My
| mental model is that the purpose of batching isn't to get larger
| input matrices. It's to move the bottleneck from memory bandwidth
| to compute. The matrices are already sharded to a much smaller
| size than the size of the entire model or even layer. So you'll
| basically load some slice of the weights from the HBM to SRAM, do
| the multiplication for that slice, and then aggregate the results
| once all tiles have been processed. Batching lets you do multiple
| separate computations with the same weights, meaning you get more
| effective FLOPS per unit of memory bandwidth.
|
| > The fact that OpenAI and Anthropic's models are quick to
| respond suggests that either:
|
| Is that actually a fact? The post has no numbers on the time to
| first token for any of the three providers.
| gfysfm wrote:
| Hi, I wrote the post! Also not a ML researcher, just an
| interested engineer, so I'm sure I got some things wrong.
|
| > MoE should be the better tradeoff for the local/single-user
| scenario since the downside of batching being harder / less
| efficient doesn't matter.
|
| What I meant was that the single-user scenario is going to get
| dramatically worse throughput-per-GPU, because they're not able
| to reap the benefits of multi-user batching (unless they're
| somehow doing massively parallel inference requests, I
| suppose).
|
| > Is it really that the matrixes being multiplied are larger?
| My mental model is that the purpose of batching isn't to get
| larger input matrices. It's to move the bottleneck from memory
| bandwidth to compute.
|
| As I understand it, you want larger input matrices in order to
| move the bottleneck from memory to compute: if you do no
| batching at all, your multiplications will be smaller (the
| weights will be the same, of course, but the next-token data
| you're multiplying with the weights will be 1xdim instead of
| batch-size x dim), so your GPUs will be under-utilized and your
| inference will spend more time doing memory operations and less
| time multiplying.
|
| > The post has no numbers on the time to first token for any of
| the three providers.
|
| I probably should have hunted down specific numbers, but I
| think people who've played with DeepSeek and other models will
| notice that DeepSeek is noticeably more sluggish.
| yekanchi wrote:
| this statement holds true for all large parameter open weight
| models.
| freehorse wrote:
| > mixture of experts requires higher batch sizes
|
| Or apple silicon for low batch size (=1 ideally). The unified
| memory allows for running larger models on the expense of them
| running slower, because of lower bandwidth/flops than a normal
| gpu. But MoEs require computing only few parameters every time,
| so the computational needs are low. I have seen people reporting
| decent speeds for deepseek for single batch inference on macs. It
| is still expensive though to many people's standards because it
| requires a lot of $$$ to get enough memory.
|
| In some ways, MoE models are perfect fit for macs (or any similar
| machines that may come out). In contrast, ordering a mac with
| upgraded ram size and running dense models that just fit in the
| vram can be very painful.
| DavidSJ wrote:
| Here's a concise explanation:
|
| - High sparsity means you need a very large batch size (number of
| requests being processed concurrently) so that each matrix
| multiplication is of sufficient arithmetic intensity to get good
| utilization.
|
| - At such a large batch size, you'll need a decent number of GPUs
| -- 8-16 or so depending on the type -- just to fit the weights
| and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate
| throughput is going to be so low that each of the many individual
| user requests will be served unacceptably slowly for most
| applications. Thus you need more like 256 GPUs for a good user
| experience.
| zxexz wrote:
| I'm serving it on 16 H100s (2 nodes). I get 50-80 tok/s per
| request, and in aggregate I've seen several thousand. TTFT is
| pretty stable. Is faster than any cloud service we can use.
| majke wrote:
| Using vllm?
| zxexz wrote:
| Oh, SGLang. Had to make a couple modifications, I forget
| what they were, nothing crazy. Lots of extra firmware,
| driver and system config too.
| latchkey wrote:
| You could do it on one node of 8xMI300x and cut your costs
| down.
| zackangelo wrote:
| H200s are pretty easy to get now. If you switched I'm
| guessing you'd get a nice bump because the nccl allreduce on
| the big mlps wouldn't have to cross infiniband.
| DavidSJ wrote:
| You're presumably using a very small batch size compared to
| what I described, thus getting very low model FLOP
| utilization (MFU) and high dollar cost per token.
| almostgotcaught wrote:
| > High sparsity means you need a very large batch size
|
| I don't understand what connection you're positing here? Do you
| think sparse matmul is actually a matmul with zeros lol
| DavidSJ wrote:
| It's sparse as in only a small fraction of tokens are
| multiplied by a given expert's weight matrices (this is
| standard terminology in the MoE literature). So to properly
| utilize the tensor cores (hence serve DeepSeek cheaply, as
| the OP asks about) you need to serve enough tokens
| concurrently such that the per-matmul batch dimension is
| large.
| almostgotcaught wrote:
| i still don't understand what you're saying - you're just
| repeating that a sparse matmul is a sparse matmul ("only a
| small fraction of tokens are multiplied by a given expert's
| weight matrices"). and so i'm asking you - do you believe
| that a sparse matmul has low/bad arithmetic intensity?
| DavidSJ wrote:
| An MoE's matmuls have the same arithmetic intensity as a
| dense model's matmuls, provided they're being multiplied
| by a batch of activation vectors of equal size.
| dist-epoch wrote:
| Do the individual requests in a batch influence each-other?
|
| Not in a floating point non-deterministic kind of way, where
| exact ordering might introduce some non-determinism (begin
| position 5th versus being position 10th in the batch lets say).
|
| I'm asking in a semantic way, can context from one request leak
| into another because they are in the same batch?
| ipieter wrote:
| This is an interesting blogpost. While the general conclusion
| ("We need batching") is true, inference of mixture of experts
| (MoE) models is actually a bit more nuanced.
|
| The main reason we want big batches is because LLM inference is
| not limited by the compute, but my loading every single weight
| out of VRAM. Just compare the number of TFLOPS of an H100 with
| the memory bandwidth, there's basically room for 300 FLOP per
| byte loaded. So that's why we want big batches: we can perform a
| lot of operations per parameter/weight that we load from memory.
| This limit is often referred to as the "roofline model".
|
| As models become bigger, this does not scale anymore because the
| model weights will not fit into GPU memory anymore and you need
| to distribute them across GPUs or across nodes. Even with NVLink
| and Infiniband, these communications are slower than loading from
| VRAM. NVlink is still fine for tensor parallelism, but across
| nodes this is quite slow.
|
| So what MoE allows is expert parallelism, where different nodes
| keep different experts in memory and don't need to communicate as
| much between nodes. This only works if there are enough nodes to
| keep all experts in VRAM and have enough overhead for other stuff
| (KV cache, other weights, etc). So naturally the possible batch
| size becomes quite large. And of course you want to maximize this
| to make sure all GPUs are actually working.
| zozbot234 wrote:
| You could load different "experts" in a round-robin way on a
| single node and only aggregate "batches" opportunistically,
| when you just have multiple requests in-flight that all happen
| to rely on the same "expert". The difference being that instead
| of "batches", you would only really have queues. Of course this
| would come with a sizeable increase in latency, but that's
| acceptable for many applications (such as for "deep research"
| workflows)
| jchrisa wrote:
| This is very much like Erlang's actor model. The same compute
| can be run in parallel, or managed via queues. With Erlang's
| strong support for FFI and process control, I wonder if it's
| being used as a dispatcher for these sorts of workloads.
| iwontberude wrote:
| And this is the investment case for AMD, models fit entirely in
| a single chassis, and side benefit: less tariffed network
| equipment to interconnect compute. Map/reduce instead of
| clustered compute.
|
| Edit: when downvoting, please offer some insight why you
| disagree
| dragonwriter wrote:
| How is the a unique advantage for AMD?
| latchkey wrote:
| AMD is consistently stacking more HBM. H100
| 80GB HBM3 H200 141GB HBM3e B200 192GB HBM3e
| MI300x 192GB HBM3 MI325x 256GB HBM3e MI355x
| 288GB HBM3e
|
| This means that you can fit larger and larger models into a
| single node, without having to go out over the network. The
| memory bandwidth on AMD is also quite good.
| krapht wrote:
| So the MI300x has 8 different memory domains, and
| although you can treat it as one flat memory space, if
| you want to reach their advertised peak memory bandwidth
| you have to work with it like an 8-socket board.
| latchkey wrote:
| Here is a good article on it:
|
| https://rocm.blogs.amd.com/software-tools-
| optimization/compu...
| ryao wrote:
| It really does not matter how much memory AMD has if the
| drivers and firmware are unstable. To give one example
| from last year:
|
| https://www.tomshardware.com/pc-components/gpus/amds-
| lisa-su...
|
| They are currently developing their own drivers for AMD
| hardware because of the headaches that they had with
| ROCm.
| latchkey wrote:
| "driver" is such a generic word. tinygrad works on
| mi300x. If you want to use it, you can. Negates your
| point.
|
| Additionally, ROCm is a giant collection of a whole bunch
| of libraries. Certainly there are issues, as with any
| large collection of software, but the critical thing is
| whether or not AMD is responsive towards getting things
| fixed.
|
| In the past, it was a huge issue, AMD would routinely
| ignore developers and bugs would never get fixed. But,
| after that SA article, Lisa lit a fire under Anush's butt
| and he's taking ownership. It is a major shift in the
| entire culture at the company. They are extremely
| responsive and getting things fixed. You can literally
| tweet your GH issue to him and someone will respond.
|
| What is true a year ago isn't today. If you're paying
| attention like I am, and experiencing it first hand,
| things are changing, fast.
| ryao wrote:
| I have been hearing this about AMD/ATI drivers for
| decades. Every year, someone says that it is fixed, only
| for new evidence to come out that they are not. I have no
| reason to believe it is fixed given the history.
|
| Here is evidence to the contrary: If ROCm actually was in
| good shape, tinygrad would use it instead of developing
| their own driver.
| latchkey wrote:
| We have all been hearing things for decades. Things are
| noticeably different now. Live in the present, not in the
| past.
|
| Tinygrad isn't a driver. It is a framework. It is being
| developed by George however he wants. If he wants to
| build something that gives him more direct control over
| things. Fine. Others might write PTX instead if using
| higher level abstractions.
|
| Fact is that tinygrad runs not only on AMD, but also
| Nvidia and others. You might want to reassess your
| beliefs because you're reading into things and coming up
| with the wrong conclusions.
| faldore wrote:
| That was last year Mi300x firmware and software have
| gotten much better since then
| cyptus wrote:
| could such a network with all its nodes and weights be deployed
| to an analog circuit and be superfast?
| rpmisms wrote:
| Please go into more detail about this proposal, this piqued
| my interest in a really strange way.
| cyptus wrote:
| The idea is to replicate the weights of the network in the
| electronics. Somehow like our brains work? This way an
| analog input signal could lead to a neural network
| processed output signal without the digital emulation on an
| gpu. As this is very much simplified, the question is if
| this could work for modern llms?
| koiueo wrote:
| Suddenly "temperature" parameter starts making sense
|
| (If you ever tried fine-tuning an analog circuit, you'll
| know how finicky the process due to the environment,
| including temperature)
| cyptus wrote:
| haha very true!
| TuringNYC wrote:
| Do you mean something like this? https://www.etched.com/
| ryao wrote:
| > As models become bigger, this does not scale anymore because
| the model weights will not fit into GPU memory anymore and you
| need to distribute them across GPUs or across nodes. Even with
| NVLink and Infiniband, these communications are slower than
| loading from VRAM. NVlink is still fine for tensor parallelism,
| but across nodes this is quite slow.
|
| Inference works by computing layers and then have a very small
| vector that you send to the next layer as input. When a model
| does not fit in a single GPU, you just divide it into layers
| and send the vector over a fabric to the GPU holding the next
| layer. The transfer happens so quickly that there is a
| negligible amount of idle time and then the next layer can be
| computed. The fastest inference on the planet at Cerebras uses
| this technique to do 2500T/sec on Llama 4 Maverick.
| jimmySixDOF wrote:
| Groq and Cerebras both take a big chip approach to
| architecture and, at least in the case of Groq, they only
| make economic sense under high batch loads.
|
| https://x.com/swyx/status/1760065636410274162?s=46
| dgfitz wrote:
| I am so sincerely amused that "we" figured out how to monetize
| LLMs from the jump using tokens.
|
| It isn't tech for techs sake, it's a money grab. Reminds me of
| paying to send a text message or buying minutes for a phone plan.
| Purely rent-seeking.
| kaashif wrote:
| Can you explain how this is rent seeking? It seems to be
| straightforwardly not rent seeking.
|
| 1. Company develops model, invests in research, hardware, and
| software.
|
| 2. Company sells access to the model.
|
| (1) is the step that makes this not rent seeking.
|
| Rent seeking is when you profit from something you didn't earn
| - land rent, monopoly profits, protectionism.
| dgfitz wrote:
| That's fair. My thought was, when there is an interesting new
| technology, it usually takes time to figure out how to
| monetize it. Figuring out how to monetize LLMs took no time
| at all.
| davidmurdoch wrote:
| "GPT 1.0" was released in 2018, I think that's a decent
| amount of time.
| dgfitz wrote:
| We must have different definitions of released.
| vikramkr wrote:
| I don't think it's obvious that any of these model providers
| are even profitable right now. I'm also not sure what there is
| to "figure out" - it's an expensive technology where the cost
| scales per token, so they charge per token? would you rather
| they burned even more money giving it away for free until
| everyone was dependent on it and then hyper enshittified to try
| and not go broke like so much of the rest of tech?
| dgfitz wrote:
| My point, poorly made, was that I can run it myself for
| "free" without caring about tokens at all. Tokens are an
| artificial construct.
| AndroTux wrote:
| So by that logic all VPS providers are just a money grab
| because you can run your software yourself for "free"
| without having to pay for that artificial construct these
| greedy people call "compute?"
|
| I don't understand your point. You're using a resource.
| You're wasting time on the GPU of someone else. That chunk
| is called a token. And that's what you're being billed.
| dgfitz wrote:
| VPS providers don't put out articles espousing their
| value twice a week because they don't need to, the value
| is obvious.
|
| I didn't mean to come off as argumentative. Again, in my
| head it's so obvious what the end game is, and it isn't
| to better humanity.
| Workaccount2 wrote:
| It's likely that no one who makes base models is currently
| making money from LLMs. More likely losing it at a crazy rate.
|
| These prices are almost certainly "introductory offer" prices
| to get people/devs to integrate AI into their
| lives/workflow/product.
|
| In a few years is when we will see what the actual cost is.
| imtringued wrote:
| >It's a peculiar feature of transformer-based LLMs that computing
| a batch of completions at the same time is almost as fast as
| computing a single completion. Why is that?
|
| Incorrect. Transformers usually contain a classical MLP layer.
| Only the MLP layer can be batched. Hence all classical neural
| networks including convolutional networks (via im2col) can be
| batched.
|
| If there's anything that the transformer architecture changes, it
| is that the attention layer cannot be batched.
| gok wrote:
| MoE is in general kind of a stupid optimization. It seems to
| require around 5x more total parameters for the same modeling
| power as a dense model in exchange for around 2x less memory
| bandwidth needs.
|
| The primary win of MoE models seems to be that you can list an
| enormous parameter count in your marketing materials.
| hansvm wrote:
| Stupid? By paying 5x (normally 2-4x, but whatever) of a thing
| you don't care about at inference you can gain 2x in the
| primary thing you care about at inference. It's like handing
| out 4 extra bricks and getting back an extra lump of gold.
| bick_nyers wrote:
| The general rule of thumb when assessing MoE <-> Dense model
| intelligence is SQRT(Total_Params*Active_Params). For Deepseek,
| you end up with ~158B params. The economics of batch
| inferencing a ~158B model at scale are different when compared
| to something like Deepseek (it is ~4x more FLOPS per inference
| after all), particularly if users care about latency.
| philipodonnell wrote:
| Isn't this an arbitrage opportunity? Offer to pay a fraction of
| the cost per token but accept that your tokens will only be
| processed when the batch window isn't big enough, then resell
| that for a markup to people who need non-time sensitive
| inference?
| pama wrote:
| You may have already noticed that many providers have separate,
| much lower, prices for offline inference.
| angry_octet wrote:
| This is a great explainer from an LLM perspective, and it would
| be interesting to see a computational scheduling explanation in
| depth. I presume that hyperscale LLM companies extensively
| examine the computation trace to identify bottlenecks and idle
| bubbles, and develop load balancers, pipeline architectures and
| schedulers in order to optimise their workload.
|
| The batching requirement for efficiency makes high security
| applications quite difficult, because the normal technique of
| isolating unrelated queries would become very expensive. The
| nVidia vGPU GPU virtualisation time shares GPU memory, and every
| switch requires unload/reload context switches, doubtful they
| have deduplication. Multi-Instance GPU (MIG) splits GPU memory
| between users, but it is a fixed partitioning scheme (you have to
| reboot the GPU to change), and nobody wants to split their 96GB
| GPU into 4x24GB GPUs.
|
| Makes me wonder what the tradeoff is for putting second level
| memory on the GPU board (i.e. normal DRAM), so that different
| matrix data can be loaded in faster than over PCIe, i.e. the HBM
| becomes a cache.
|
| (I'm also really liking the honesty in the authors book on
| Software Engineering, not in the dry IEEE sense, but as a
| survival guide in a large enterprise.
| https://www.seangoedecke.com/book/ )
| slavboj wrote:
| It is not "slow and expensive", although it could be "or". You
| can get 3 tokens / second running on DDR4 memory on a two
| generation old workstation system that costs ~1K, via llama.cpp .
| KolmogorovComp wrote:
| You're most likely confusing the real deepseek with a distilled
| version. Unless you have more than 192Gb of RAM.
| bick_nyers wrote:
| There's still a lot of opportunity for software optimizations
| here. Trouble is that really only two classes of systems get
| optimizations for Deepseek, namely 1 small GPU + a lot of RAM
| (ktransformers) and the system that has all the VRAM in the
| world.
|
| A system with say 192GB VRAM and rest standard memory (DGX
| station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory
| run Deepseek @4bit quite quickly because of the power law type
| usage of the experts.
|
| If you aren't prompting Deepseek in Chinese, a lot of the experts
| don't activate.
|
| This would be an easier job for pruning, but still I think
| enthusiast systems are going to trend in a way the next couple
| years that makes these types of software optimizations useful on
| a much larger scale.
|
| There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4
| interconnect which doesn't seem to be using full bandwidth during
| tensor parallelism) that gets 7 token/s in llama.cpp. A single
| 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39
| times per second, so there's something else going on limiting
| performance.
| latchkey wrote:
| A single MI300x has 192GB of vram.
| MoonGhost wrote:
| > 16x 3090 system
|
| That's about 5KW of power
|
| > that gets 7 token/s in llama.cpp
|
| Just looking at electricity bill it's cheaper to use API of any
| major providers.
|
| > If you aren't prompting Deepseek in Chinese, a lot of the
| experts don't activate.
|
| That's interesting, it means the model can be cut and those
| token routed to another closest expert, just in case they
| happened.
| corey_moncure wrote:
| If I understand it correctly, the effect of experts is a weighted
| sum of the individual calculation of each token meeting each
| expert, where experts to be met by a token are selected on an
| individual basis. Since a sum is commutative, though, it should
| be possible to send a large batch of tokens copied to multiple
| GPUs, where experts are streamed into VRAM, partitioned across
| GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at
| Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you
| to upload a half precision quant of DeepSeek (about 360 GB) in
| about 6 seconds. 1 GPU - 30 GB/s TX - 12
| seconds 2 GPUs - 60 GB/s TX - 6 seconds 4 GPUs - 120
| GB/s TX - 3 seconds
|
| Then you just optimize your batch size to match the compute time
| to the upload time of each GPU. The expert calculation results
| can be retrieved from the GPUs and summed up.
| briian wrote:
| This reminded me that the economies of scale in AI, especially
| inference, is huge.
|
| When people say LLMs will be commoditised, I am not sure that
| means that the market is going to be super competitive. As the
| economies of scale of AI get even bigger (larger training costs +
| batch inference etc.) it just seems likely only around 3
| companies will dominate LLMs.
| riku_iki wrote:
| For inference cost, I don't see how this is different from
| cloud providers vs dedicated server providers, where AWS is
| 5-10x more expensive than hetzner.
|
| Somehow cloud providers manage to add lots of extra-cost on
| offering.
| ryan_glass wrote:
| I run Deepseek V3 locally as my daily driver and I find it
| affordable, fast and effective. The article assumes GPU which in
| my opinion is not the best way to serve large models like this
| locally. I run a mid-range EPYC 9004 series based home server on
| a supermicro mobo which cost all-in around $4000. It's a single
| CPU machine with 384GB RAM (you could get 768GB using 64GB sticks
| but this costs more). No GPU means power draw is less than a
| gaming desktop. With the RAM limitation I run an Unsloth Dynamic
| GGUF which, quality wise in real-world use performs very close to
| the original. It is around 270GB which leaves plenty of room for
| context - I run 16k context normally as I use the machine for
| other things too but can up it to 24k if I need more. I get about
| 9-10 tokens per second, dropping to 7 tokens/second with a large
| context. There are plenty of people running similar setups with 2
| CPUs who run the full version at similar tokens/second.
| nardi wrote:
| Whats your prompt processing speed? That's more important in
| this situation than output TPS. If you have to wait minutes to
| start getting an answer, that makes it much worse than a cloud-
| hosted version.
| pclmulqdq wrote:
| I assume KV caching makes this a non issue, but I'm also
| curious.
| ryao wrote:
| If he is doing multiturn conversations, he can reuse the kv
| cache from the last turn and skip the prompt processing on
| the history that would make time to first token too slow, by
| only doing prompt processing on his actual prompt for the
| current turn. This turns a quadratic amount of tokens to
| process into a linear number. I am not sure if this is what
| he is doing, but that is what I would do if I had his
| hardware.
| ryan_glass wrote:
| Prompt eval time varies a lot with context but it feels real-
| time for short prompts - approx 20 tokens per second but I
| haven't done much benchmarking of this. When there is a lot
| of re-prompting in a long back and forth it is still quite
| fast - I do use KV cache which I assume helps and also
| quantize the KV cache to Q8 if I am running contexts above
| 16k. However, if I want it to summarize a document of say
| 15,000 words it does take a long time - here I walk away and
| come back in about 20 minutes and it will be complete.
| jeff_carr wrote:
| I am impressed. Your personal website is down. HN doesn't allow
| private messages.
|
| I'm Jeff Carr. I co-founded digital ocean. I assume I can't
| post email addresses here, but I will try. lets see how smart
| things are from banning me. I am: wit AT wit com
| p12tic wrote:
| State of the art of local models is even further.
|
| For example, look into https://github.com/kvcache-
| ai/ktransformers, which achieve >11 tokens/s on a relatively
| old two socket Xeon servers + retail RTX 4090 GPU. Even more
| interesting is prefill speed at more than 250 tokens/s. This
| is very useful in use cases like coding, where large prompts
| are common.
|
| The above is achievable today. In the mean time Intel guys
| are working on something even more impressive. In
| https://github.com/sgl-project/sglang/pull/5150 they claim
| that they achieve >15 tokens/s generation and >350 tokens/s
| prefill. They don't share what exact hardware they run this
| on, but from various bits and pieces over various PRs I
| reverse-engineered that they use 2x Xeon 6980P with MRDIMM
| 8800 RAM, without GPU. Total cost of such setup will be
| around $10k once cheap Engineering samples hit eBay.
| qeternity wrote:
| It's not impressive nor efficient when you consider batch
| sizes > 1.
| p12tic wrote:
| All of this is for batch size 1.
| pclmulqdq wrote:
| CPUs are quietly becoming very well-balanced machines for BS 1
| inference. The latest Intel Xeons should be at ~20 TPS.
| Spooky23 wrote:
| A base Mac Mini is ~20 :)
| pclmulqdq wrote:
| Oh yeah, I did that math not assuming any quantization. I
| think if you can get a 3-4 bit quant working + int8 math,
| ~80 might be achievable.
| platevoltage wrote:
| Impressive. I need to look more into this. I'm doing my best to
| limit my LLM usage to what I can run locally.
| jbellis wrote:
| impressive, but that's 1/5 to 1/10 of the throughput that you'd
| get with a hosted provider, with 1/4 to 1/8 the supported
| context
| michelsedgh wrote:
| Dude he's running locally, and I think this setup is the best
| bang for the buck if you wanna run locally, we're not
| comparing to data centers, you gotta keep it in perspective.
| That's very impressive results for running local. Thanks for
| the numbers you saved me a chatgpt search :)
| carstenhag wrote:
| Title says: locally it's expensive
|
| Other person says: I had to spend 4000$ and it's still slow
| refibrillator wrote:
| > Unsloth Dynamic GGUF which, quality wise in real-world use
| performs very close to the original
|
| How close are we talking?
|
| I'm not calling you a liar OP, but in general I wish people
| perpetuating such broad claims would be more rigorous.
|
| Unsloth does amazing work, however as far as I'm aware even
| they themselves do not publish head to head evals with the
| original unquantized models.
|
| I have sympathy here because very few people and companies can
| afford to run the original models, let alone engineer rigorous
| evals.
|
| However I felt compelled to comment because my experience does
| not match. For relatively simple usage the differences are hard
| to notice, but they become much more apparent in high
| complexity and long context tasks.
| ryan_glass wrote:
| You are right that I haven't been rigorous - it's easy to
| benchmark tokens/second but quality of output is more
| difficult to nail down. I couldn't find any decent
| comparisons for Unsloth either. So I just tried a few of
| their models out, looking for something that was 'good
| enough' i.e. does all I need: coding, summarizing documents,
| troubleshooting anything and everything. I would like to see
| head to head comparisons too - maybe I will invest in more
| RAM at some stage but so far I have no need for it. I ran
| some comparisons between the smaller and larger versions of
| the Unsloth models and interestingly (for me anyway) didn't
| notice a huge amount of difference in quality between them.
| But, the smaller models didn't run significantly faster so I
| settled for the biggest model I could fit in RAM with a
| decent context. For more complex coding I use Deepseek R1
| (again the Unsloth) but since it's a reasoning model it isn't
| real-time so no use as my daily driver.
| 3eb7988a1663 wrote:
| Do you have hard numbers on the idle/average/max power draw? I
| assumed that server machines are built as if they are going to
| red-lined constantly so put less effort into low-utilization
| optimizations.
| ryan_glass wrote:
| No hard numbers I'm afraid in that I don't monitor the power
| draw. But the machine uses a standard ATX power supply: a
| Corsair RM750e 750W PSU and the default TDP of the CPU is
| 280W - I have my TDP set at 300W. It is basically built like
| a desktop - ATX form factor, fans spin down at idle etc.
| dotancohen wrote:
| Just curious what your use cases are? What type of texts are
| you producing?
|
| Thank you.
| fdfofeoijfeoie wrote:
| Related: https://stackoverflow.com/q/79454372/320615
| cycomanic wrote:
| I was talking with a colleague the other day and we came to the
| conclusion that in our experience if you're using llms as a
| programming help models are really being optimised for the wrong
| things.
|
| At work I often compare locallly run 4-30B models against various
| GPTs (we can only use non-local models for few things, because of
| confidentiality issues). While e.g. GPT-4o gives better results
| on average, the chances of it making parts of the response up is
| high enough that one has to invest significant amount to check
| and iterate over results. So the difference in effort is not much
| lower compared to the low parameter models.
|
| The problem is both are just too slow to really iterate quickly,
| which makes things painful. I'd rather have a lower quality model
| (but with large context) that gives me near instant responses
| instead of a higher quality model that is slow. I guess that's
| not giving you the same headlines as the improved score on some
| evaluation.
___________________________________________________________________
(page generated 2025-06-01 23:00 UTC)