[HN Gopher] Why DeepSeek is cheap at scale but expensive to run ...
       ___________________________________________________________________
        
       Why DeepSeek is cheap at scale but expensive to run locally
        
       Author : ingve
       Score  : 262 points
       Date   : 2025-06-01 07:31 UTC (15 hours ago)
        
 (HTM) web link (www.seangoedecke.com)
 (TXT) w3m dump (www.seangoedecke.com)
        
       | comrade1234 wrote:
       | I haven't looked for awhile but is deepseek online still about
       | 1/100th the cost of its competitors?
        
         | ALLTaken wrote:
         | I don't know the exact cost-breakdown, but they've come up with
         | a few really inspiring and qualitatively high value papers that
         | demonstrate how they further increased efficiency at their
         | scale. Along with it they also published quite a few
         | repositories with fully open-source code.
         | 
         | I stopped using ChatGPT as it was just reinforcing my prompts
         | and not ever giving deeper insights, except something I call
         | manipulative behaviour.
         | 
         | DeepSeek was seriously cool, but it started behaving similar to
         | Google Gemini Pro, which just tries to be lazy, if you give it
         | a hard task to chew on. It basically gives you patch-files
         | instead of printing out the whole code, which is more tedious
         | doing manually, than c/p the code.
         | 
         | It also started indexing our private repository and some
         | corporate repositories that were on GitHub behind MFA and
         | stringent lock. Definitely illegal.
        
           | diggan wrote:
           | > It also started indexing our private repository and some
           | corporate repositories that were on GitHub behind MFA and
           | stringent lock. Definitely illegal.
           | 
           | What is "it" in this context, the DeepSeek weights? Sounds
           | like you're talking about some application, but AFAIK,
           | DeepSeek doesn't maintain any applications, only their API +
           | released weights.
        
           | simianwords wrote:
           | How did it have access to your private repo and how did you
           | find out?
        
             | ALLTaken wrote:
             | I made a video of it with a friend. The repository is of a
             | large corporate automative industry company. I also have my
             | own private repositories which were always private and
             | OpenAI printed my files in the first prompt. When I
             | prompted again it acted as if it didn't know. But my friend
             | tried on his account and could access the Corp and my
             | private repository without ever being linked.
             | 
             | The Corporate repository was of Volkswagen. It's quite
             | serious of a breach. I only gave it the name of the
             | repository and it printed the files, which shouldn't be
             | possible.
             | 
             | Maybe OpenAI exploits Microsoft to access GitHub fully to
             | train their AI on all of humanity's code for free,
             | violating privacy, security, IP and copyright.
        
               | Legend2440 wrote:
               | >I only gave it the name of the repository and it printed
               | the files, which shouldn't be possible.
               | 
               | Are you sure these weren't just plausible guesses at file
               | names? It's just a hallucination.
               | 
               | I asked it for the list of files in some public
               | repositories (which are definitely in the training data)
               | and it gave me a plausible-but-wrong list of files. It
               | can't remember that kind of detail.
        
           | singularity0808 wrote:
           | ChatGPT is reinforcing your prompts, DeepSeek is cool but
           | starts acting lazy like Gemini.
           | 
           | So what are you working with now? Deepseek or something else?
        
           | ants_everywhere wrote:
           | > as it was just reinforcing my prompts and not ever giving
           | deeper insights, except something I call manipulative
           | behaviour.
           | 
           | Try telling Deepseek you want to murder political dissidents.
           | In my experiments Deepseek will start enthusiastically
           | reinforcing your prompts.
        
             | johnisgood wrote:
             | It just simply does its job. We can add sorts of arbitrary
             | safeguards, but then what is the point of using an LLM?
             | Perhaps local modals are the future, because reverse
             | engineers may not even be able to use the new Claude (just
             | read its system prompt to not help with backdoors, and so
             | forth).
        
               | ants_everywhere wrote:
               | Yes that's true. But in this case it's the (probably)
               | unintended consequence of an intentional safeguard.
               | Namely, Deepseek has an obligation to spread the Chinese
               | version of socialism, which means it's deliberately
               | trained on material advocating for or justifying
               | political violence.
        
               | johnisgood wrote:
               | Well, I do not like that, for sure. Just put the politics
               | and all that aside, I think it should lean towards
               | neutrality, even if humans cannot... they should still
               | make the LLM more neutral instead of pushing their own
               | agenda, see Grok and white genocide in South Africa (Elon
               | Musk's political opinion).
        
             | MangoToupe wrote:
             | Is this a reference to something? Political dissidents
             | relative to which state? Does it change if you swap out the
             | states? How did you discover this to begin with? Why did
             | you initially suggest murdering political dissidents?
             | 
             | this comment really raises so many questions I must have
             | missed something
             | 
             | Still, chatbots are just as vulnerable to state-driven
             | propaganda as the rest of us. Probably even more so. I
             | imagine if you just referred to dissidents as "terrorists"
             | the rhetoric would fit right in in most opinion pages
             | across the globe. The distinction between "terrorist" and
             | "dissident" and "freedom fighter" seems quite subjective. I
             | probably would avoid such heavily connoted floating
             | signifiers if you want the chatbot to be useful.
             | 
             | LLMs have nothing to contribute to political discourse
             | aside from regurgitation of propaganda. Almost by
             | definition.
        
               | ants_everywhere wrote:
               | Starting at the end
               | 
               | > LLMs have nothing to contribute to political discourse
               | aside from regurgitation of propaganda. Almost by
               | definition.
               | 
               | I don't think this is true. LLMs should be well-
               | positioned to make advances in political science, game
               | theory, and related topics.
               | 
               | > Is this a reference to something?
               | 
               | It's just a reference to my experiments. I filmed some of
               | them. There's a tame version here [0] where I just prompt
               | it to tell the truth. I also have a less tame version I
               | haven't posted where I lie and say I work for an
               | intelligence agency.
               | 
               | The underlying mechanic is that Deepseek has built-in
               | obligations to promote revolutionary socialism.
               | 
               | > Political dissidents relative to which state? Does it
               | change if you swap out the states?
               | 
               | Relative to China or any socialist state. Yes it will
               | change if you change the states because it was trained to
               | comply with Chinese regulations.
               | 
               | > How did you discover this to begin with?
               | 
               | I asked to to honestly describe its training and then
               | started trolling it when it told me it was essentially
               | created for propaganda purposes to spread Chinese values
               | abroad.
               | 
               | > Why did you initially suggest murdering political
               | dissidents?
               | 
               | I wanted to check what its safeguards were. Most LLMs
               | refuse to promote violence or unethical behavior. But
               | revolutionary socialism has always devoted a lot of words
               | to justifying violence against dissidents. So I was
               | curious whether that would show up in its training.
               | 
               | > I imagine if you just referred to dissidents as
               | "terrorists" the rhetoric would fit right in in most
               | opinion pages across the globe.
               | 
               | First of all, terrorists are by definition violent
               | offenders. Dissidents are not. When you ask Deepseek to
               | help identify dissidents it tells you to look for people
               | who frequently complain about the police or the
               | government. In the US that would include large swaths of
               | Hacker News.
               | 
               | Second, most people in countries like the US don't
               | support murdering terrorists and most LLMs would not
               | advocate that. In the US it's rare for people to advocate
               | killing those opposed to the government. Even people who
               | try to violently overthrow the government get trials.
               | 
               | [0] https://www.youtube.com/watch?v=U-FlzbweHvs
        
               | MangoToupe wrote:
               | Do you think LLMs don't further the propaganda emanating
               | from the US? I don't even know how you would start to
               | excise that, especially if you don't agree with
               | foreigners on what's propaganda vs just "news" or
               | whatever.
               | 
               | I have quite a few Chinese friends, both on mainland and
               | throughout south-east asia, and I can speak a little
               | mandarin, and I can read quite a bit of Chinese. My
               | friends complain about the PRC quite a bit. But I find it
               | telling that this complaint specifically--authoritarian
               | political oppression--seems to mostly come from the west,
               | and especially from the US. And it's true that we can say
               | obscene things to the president's face and not get locked
               | up. I don't think that's necessarily the "gotcha" you
               | think it is, though--we're really good at complaining,
               | but not so good at actually fixing. Which feels
               | increasingly more embarrassing than restrictions on
               | speech.
               | 
               | Edit: I suppose I'm a bit unfair. A lot of folks in our
               | sphere of influence in east asia say stuff like this,
               | too. But the contrast between the folks I know _who
               | literally live in china_ and americans feels striking to
               | me.
               | 
               | > But revolutionary socialism has always devoted a lot of
               | words to justifying violence against dissidents.
               | 
               | It is very difficult to take the political opinions of
               | people who talk like this seriously.
               | 
               | > LLMs should be well-positioned to make advances in
               | political science, game theory, and related topics.
               | 
               | I'm struggling to understand what this might look like,
               | and I find the argument that nuclear warfare being
               | related to game theory to be extremely dubious. Cuz if it
               | really held that strongly, we should be handing out nukes
               | like candy.
        
               | ants_everywhere wrote:
               | > It is very difficult to take the political opinions of
               | people who talk like this seriously.
               | 
               | This tells me you haven't read the literature.
               | 
               | I've probably seen 150 versions of the comment you made,
               | but almost everyone tries to explain why the violence is
               | justified.
               | 
               | People rarely try to deny that revolutionary socialism is
               | a violent ideology since every major writer from Marat to
               | Marx to Lenin to Mao has explicitly advocated violence
               | against civilian non-combatants. Some, like Marx, even
               | explicitly call it terror (as in terrorism).
        
               | im3w1l wrote:
               | I think many Americans, probably the majority, support
               | murdering foregin terrorists. GITMO is still not closed
               | btw.
        
               | Spooky23 wrote:
               | > Second, most people in countries like the US don't
               | support murdering terrorists and most LLMs would not
               | advocate that. In the US it's rare for people to advocate
               | killing those opposed to the government.
               | 
               | Many are happy to send "them" off to Central America,
               | where someone else will murder them. The government may
               | make mistakes, but you need to break some eggs to make an
               | omelet.
        
               | Hilift wrote:
               | > LLMs have nothing to contribute to political discourse
               | 
               | A non-trivial percentage of the population is easily
               | influenced, which is leveraged by social media being
               | there 24x7. It's likely that LLMs will be there to craft
               | political messages, themes, and campaigns, perhaps as
               | early as the US mid term elections. Look at JD Vance
               | traveling the globe stating that the US will be the world
               | leader in AI, with none of the limits/guardrails that
               | were discussed in Europe in February. AI-driven
               | discourse, AI-created discourse.
               | 
               | https://www.marketingaiinstitute.com/blog/jd-vance-ai-
               | speech
        
               | MangoToupe wrote:
               | 100% agree with this, but I am definitely not endorsing
               | that we _should_ use LLMs to propagate propaganda.
               | 
               | I also think the whole "safety" thing was just
               | befuddling. You can't regulate software, not _really_ ,
               | just its commercial sale
        
               | Spooky23 wrote:
               | We can and should regulate software being used to shape
               | public opinion. It's probably the great threat of our
               | generation.
        
               | MangoToupe wrote:
               | I mean we can and should _try_ , but laws mostly stop
               | honest people from hurting each other. But the underlying
               | software is inherently out there and you can't put the
               | toothpaste back in the tube.
        
               | Spooky23 wrote:
               | Bro, already happened. There has been consultants pushing
               | social media bots for that purpose almost immediately
               | after these models became available.
               | 
               | Do you really think those armies of idiot commentators
               | are all real? The agent provocateur is usually a bot. You
               | see it here sometimes on Russia stories.
        
           | VectorLock wrote:
           | >It basically gives you patch-files instead of printing out
           | the whole code
           | 
           | I've noticed on the Aider leaderboard that Google Gemini Pro
           | has an "Edit Format" listed as "diff-fenced" and things like
           | ChatGPT have "architect" edit format where Aider asks
           | separate "architect" and "code" models. Seems like Gemini Pro
           | prefers the diff format.
        
             | zxexz wrote:
             | The diff-fenced is iirc specific to Gemini models, they
             | really don't like the file path outside of the fence. The
             | architect mode still uses one of the other edit format, the
             | prompt just ends up a little different.
        
             | ALLTaken wrote:
             | I met a Googler when I was in Dubai for an event and he
             | shared that he and others had access to LLMs internally for
             | many years before it was made popular by OpenAI.
             | 
             | I know Google has an internal AI everything policy, maybe
             | they internally have awesome tools to rearchitect
             | everything based on diffs and in the typical google way
             | they adapted it to their own internal tools. You know,
             | Google.. like they don't give a damn about the user, the
             | product design or actually anything other than profit/roi.
             | 
             | So many great discontinued products.. I think they killed
             | RSS.
        
           | ashirviskas wrote:
           | > DeepSeek was seriously cool, but it started behaving
           | similar to Google Gemini Pro
           | 
           | You should be able to use the version of DeepSeek that you
           | prefer indefinitely if you host it yourself or choose that
           | specific version with your preferred provider.
        
           | zxexz wrote:
           | You should self host not trust a third party application if
           | you run into either of those things. The weights are open.
           | DeepSeek didn't change, the application you're accessing it
           | through did.
           | 
           | Or use an enterprise-ready service. Bedrock, firecracker, etc
        
             | ALLTaken wrote:
             | I like your thinking. Nobody can use ChatGPT offline or
             | retrain it, but DeepSeek is fully opensource. It's
             | technology, I don't care which country made it, if it's
             | high quality engineering, it's just that. The data it was
             | trained on doesn't matter if you can train a wholly new
             | model using the exact same principles and stack they
             | opensourced with your own data. Which is really awesome.
             | 
             | I use openrouter.ai to have no timeouts and offtimes, since
             | DeepSeek seems to get DDoS attacks somehow, or there are
             | too many users, idk.
        
           | davidmurdoch wrote:
           | Had Gemini 2.5 Pro preview running in agent mode in VSCode on
           | a 3000+ line file. It patched it to about 200 lines with a
           | comment in the middle: "// the rest of the code is
           | unchanged".
        
             | ALLTaken wrote:
             | Exactly my experience too and it's soo annoying. It doesn't
             | matter how you prompt it or what your system prompt is. It
             | tries to end the session as early as possible, claiming to
             | have fulfilled everything. Although it just causes more
             | work for the user, less for itself. The tokens saved are
             | easily multiplied by the amount you have to prompt it
             | again.
             | 
             | This I experienced partially in DeepSeek since their recent
             | update too, not as aggresively as in Gemini 2.5 Pro, but
             | similar lazyness or cleverness, if you may call that
             | clever.
        
         | clippyplz wrote:
         | Depends on who you think its competitors are - deepseek-chat
         | ($0.27/M in; $1.10/M out) is twice as expensive as Gemini 2.5
         | Flash ($0.15; $0.60) but far cheaper than Claude Sonnet 4 ($3;
         | $15).
        
         | Hilift wrote:
         | That was a pretty good back to reality flex. There really isn't
         | much of a market for expensive products. An inexpensive product
         | that has a few tradeoffs will probably have the advantage.
         | Given how proficient China is at accessing technology
         | resources, it seems likely to me that any chip sanctions
         | against them will probably not be effective.
        
         | dist-epoch wrote:
         | 1/10-20th is a more realistic ratio.
        
       | perching_aix wrote:
       | For those looking to save time, the answer is batched inference.
       | Pretty much running multiple people's "prompts" through a model
       | instance at the same time instead of just really tightly
       | timesharing each model instance.
       | 
       | This is also why you may experience a variance in replies when
       | using these services, even when you set the temperature to 0 and
       | the seed to a fixed value. It's cause you don't control the other
       | prompts yours get batched with. Could this be a data exfiltration
       | attack vector? Probably, I didn't "research" that far.
        
         | pcwelder wrote:
         | > other prompts yours get batched with
         | 
         | Why would batching lead to variance?
        
           | Hendrikto wrote:
           | Because these models are context-sensitive. Every token can
           | influence the output.
        
             | simianwords wrote:
             | I believe they are talking about latency variance. Batching
             | can increase variance because you may have to wait for
             | enough prompts to get to the batch size.
        
               | perching_aix wrote:
               | No, I meant that the responses will be different run-to-
               | run. [0]
               | 
               | [0] https://152334h.github.io/blog/non-determinism-in-
               | gpt-4/
        
               | exe34 wrote:
               | Variance based on actual randomness would be one thing,
               | but to me variance based on what other people are running
               | seems concerning, for reasons I can't quite articulate. I
               | don't want the model to reply to a question in one domain
               | based on what a large group of other people are thinking
               | in a different domain (e.g. if they're discussing the
               | news with chatgpt).
        
               | zackangelo wrote:
               | This definitely happens, and I'm surprised it's not
               | talked about more often. Some attention kernels are more
               | susceptible to this than others (I've found that paged
               | attention is better than just naive attention, for
               | example).
        
               | exe34 wrote:
               | To be fair, I suppose people do it too - if you ask me a
               | question about A, often as not the answer will be
               | coloured by the fact that I just learnt about B.
        
             | immibis wrote:
             | But not the tokens that don't even feed into your output
             | because they're feeding into someone else's output.
             | Separate items in batches don't get mixed up with each
             | other - they just run the model separately on each item at
             | the same time, like SIMD.
        
           | jerpint wrote:
           | Batching can lead to variance with things like batchnorm but
           | most transformers use layer norm to avoid this problem
        
             | amelius wrote:
             | Batchnorm can only have an effect between batches during
             | training, not inference.
        
           | kouteiheika wrote:
           | > Why would batching lead to variance?
           | 
           | Depending on the shape of the data a slightly different
           | kernel implementation (for e.g. matrix multiplication, etc.)
           | will be the most optimal, and those will give slightly
           | different results. There could also be other sources of non-
           | determinism depending on the implementation (e.g. some
           | kernels are inherently not entirely deterministic as they use
           | tricks to go faster).
        
             | zxexz wrote:
             | Yep, this. I see a lot of other worryingly confident
             | answers in the thread that are wrong.
             | 
             | SGLang finally has at least some notes[0], but I'm always
             | surprised there isn't more of a community wide effort to
             | trace down the sources of indeterminism.
             | 
             | [0] https://docs.sglang.ai/references/faq.html
        
             | bhickey wrote:
             | Some of the non-determinism mentioned above manifests as
             | sensitivity to _where_ data falls within a batch.
        
               | tough wrote:
               | In my experience with other regular models, once the
               | context starts to fill up, quality starts to degrade.
               | 
               | wouldn't getting batched at the end of a batch, have a
               | similar -effect- on the results, where your prompt might
               | recieve overall less attention focused into it, if the
               | context window is almost full?
               | 
               | Idk just going by the vibes
        
             | delusional wrote:
             | > not entirely deterministic
             | 
             | There's a Nobel prize waiting for you if that's the case.
             | I'll assume you meant theoretically consistent or accurate.
        
           | empiko wrote:
           | In some mixture-of-experts approaches, samples or tokens are
           | being distributed among experts. The experts are selected by
           | trying to predict what is a good expert-sample match.
           | Depending on your neighbors in the batch, you might be
           | assigned different experts.
        
           | imtringued wrote:
           | Attention doesn't get batched and the runtime of attention
           | for a given users token depends on the total context length.
           | Hence even in the ideal scenario of you getting a dedicated
           | attention calculating GPU, the MLP calculating GPU doing
           | batching will have to wait for the slowest user.
           | 
           | In the worst case scenario you are sharing a single attention
           | calculating GPU with someone who has a super long context
           | window, then that guy will be hogging most of the memory
           | bandwidth of the GPU, even though you both are generating the
           | same quantity of tokens.
           | 
           | This means that in the distributed setting, you will not only
           | need dedicated GPUs for the model and attention calculations,
           | you will also need to duplicate the whole setup for a variety
           | of context lengths, so that long contexts are batches
           | alongside other long contexts and short contexts are batches
           | alongside other short contexts.
        
         | yjftsjthsd-h wrote:
         | > Pretty much running multiple people's "prompts" through a
         | model instance at the same time instead of just really tightly
         | timesharing each model instance.
         | 
         | I naively assumed providers did that with all models. Or does
         | it only work for this (family of?) model(s)?
        
           | hansvm wrote:
           | It works for a lot of families but not all. You need a high
           | enough degree of sharing of model weights between different
           | queries for that to make sense (memory access being the usual
           | bottleneck nowadays, though smaller models see something
           | similar with matmul batch efficiencies for CPU related
           | reasons).
           | 
           | Fully connected transformers trivially work (every weight for
           | every query). MoE works beyond a certain size or with certain
           | types of mixing (still using every weight, or using a high
           | enough fraction that there's some sharing with batches of 20+
           | queries). As you push further that direction though (lots of
           | techniques, but the key point being accessing less of the
           | model at once and bypassing some of it for each query), you
           | need larger and larger batches for those efficiency gains to
           | materialize. At some point it becomes untenable because of
           | latency waiting for batches of data, and past that it becomes
           | untenable because of the volume of query data.
        
         | VectorLock wrote:
         | Sounds like an amazing attack vector if your prompts get mixed
         | with other's.
        
         | taneq wrote:
         | Wow, almost like Deepseek's impressive performance is the
         | result of optimisation by smart engineers.
        
           | perching_aix wrote:
           | Not sure why the snarky tone, didn't say or imply otherwise,
           | nor did anyone else in the thread so far that I could see.
        
         | energy123 wrote:
         | What's the average batch size?
        
         | larodi wrote:
         | Batching. Yes.
         | 
         | And one thing it can help you locally is when you rate certain
         | content and want to make sure it didn't hallucinate. So you
         | toss 3 or 5 times or... batch_size times .)
         | 
         | Curious that batch if has been there from day one, but it takes
         | a while for people to see/grasp/grok it.
        
       | jsnell wrote:
       | I'm not a ML research or engineer, so take this with a grain of
       | salt, but I'm a bit confused by this post.
       | 
       | Deepseek V3/R1 are expensive to run locally because they are so
       | big compared to the models people usually run locally. The number
       | of active parameters is obviously lower than the full model size,
       | but that basically just helps with the compute requirements, not
       | the memory requirements. Unless you have multiple H100s lying
       | around, V3/R1 are only run locally as impractical stunts with
       | some or all the model being stored on low bandwidth memory.
       | 
       | We can't compare the size of Deepseek V3 to that of any
       | proprietary frontier models because we don't know the size of
       | those models at all (or even their architecture). The models
       | being compared to that are "expensive at scale" you can't run
       | locally at all, but surely we have no reason to believe that
       | they'd somehow be cheap to run locally?
       | 
       | But I thought you'd typically expect exactly the opposite effect
       | than is claimed here? MoE should be the _better_ tradeoff for the
       | local /single-user scenario since the downside of batching being
       | harder / less efficient doesn't matter.
       | 
       | > Bigger batches raise latency because user tokens might be
       | waiting up to 200ms before the batch is full enough to run, but
       | they boost throughput by allowing larger (and thus more
       | efficient) GEMMs in the feed-forward step
       | 
       | Is it really that the matrixes being multiplied are larger? My
       | mental model is that the purpose of batching isn't to get larger
       | input matrices. It's to move the bottleneck from memory bandwidth
       | to compute. The matrices are already sharded to a much smaller
       | size than the size of the entire model or even layer. So you'll
       | basically load some slice of the weights from the HBM to SRAM, do
       | the multiplication for that slice, and then aggregate the results
       | once all tiles have been processed. Batching lets you do multiple
       | separate computations with the same weights, meaning you get more
       | effective FLOPS per unit of memory bandwidth.
       | 
       | > The fact that OpenAI and Anthropic's models are quick to
       | respond suggests that either:
       | 
       | Is that actually a fact? The post has no numbers on the time to
       | first token for any of the three providers.
        
         | gfysfm wrote:
         | Hi, I wrote the post! Also not a ML researcher, just an
         | interested engineer, so I'm sure I got some things wrong.
         | 
         | > MoE should be the better tradeoff for the local/single-user
         | scenario since the downside of batching being harder / less
         | efficient doesn't matter.
         | 
         | What I meant was that the single-user scenario is going to get
         | dramatically worse throughput-per-GPU, because they're not able
         | to reap the benefits of multi-user batching (unless they're
         | somehow doing massively parallel inference requests, I
         | suppose).
         | 
         | > Is it really that the matrixes being multiplied are larger?
         | My mental model is that the purpose of batching isn't to get
         | larger input matrices. It's to move the bottleneck from memory
         | bandwidth to compute.
         | 
         | As I understand it, you want larger input matrices in order to
         | move the bottleneck from memory to compute: if you do no
         | batching at all, your multiplications will be smaller (the
         | weights will be the same, of course, but the next-token data
         | you're multiplying with the weights will be 1xdim instead of
         | batch-size x dim), so your GPUs will be under-utilized and your
         | inference will spend more time doing memory operations and less
         | time multiplying.
         | 
         | > The post has no numbers on the time to first token for any of
         | the three providers.
         | 
         | I probably should have hunted down specific numbers, but I
         | think people who've played with DeepSeek and other models will
         | notice that DeepSeek is noticeably more sluggish.
        
       | yekanchi wrote:
       | this statement holds true for all large parameter open weight
       | models.
        
       | freehorse wrote:
       | > mixture of experts requires higher batch sizes
       | 
       | Or apple silicon for low batch size (=1 ideally). The unified
       | memory allows for running larger models on the expense of them
       | running slower, because of lower bandwidth/flops than a normal
       | gpu. But MoEs require computing only few parameters every time,
       | so the computational needs are low. I have seen people reporting
       | decent speeds for deepseek for single batch inference on macs. It
       | is still expensive though to many people's standards because it
       | requires a lot of $$$ to get enough memory.
       | 
       | In some ways, MoE models are perfect fit for macs (or any similar
       | machines that may come out). In contrast, ordering a mac with
       | upgraded ram size and running dense models that just fit in the
       | vram can be very painful.
        
       | DavidSJ wrote:
       | Here's a concise explanation:
       | 
       | - High sparsity means you need a very large batch size (number of
       | requests being processed concurrently) so that each matrix
       | multiplication is of sufficient arithmetic intensity to get good
       | utilization.
       | 
       | - At such a large batch size, you'll need a decent number of GPUs
       | -- 8-16 or so depending on the type -- just to fit the weights
       | and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate
       | throughput is going to be so low that each of the many individual
       | user requests will be served unacceptably slowly for most
       | applications. Thus you need more like 256 GPUs for a good user
       | experience.
        
         | zxexz wrote:
         | I'm serving it on 16 H100s (2 nodes). I get 50-80 tok/s per
         | request, and in aggregate I've seen several thousand. TTFT is
         | pretty stable. Is faster than any cloud service we can use.
        
           | majke wrote:
           | Using vllm?
        
             | zxexz wrote:
             | Oh, SGLang. Had to make a couple modifications, I forget
             | what they were, nothing crazy. Lots of extra firmware,
             | driver and system config too.
        
           | latchkey wrote:
           | You could do it on one node of 8xMI300x and cut your costs
           | down.
        
           | zackangelo wrote:
           | H200s are pretty easy to get now. If you switched I'm
           | guessing you'd get a nice bump because the nccl allreduce on
           | the big mlps wouldn't have to cross infiniband.
        
           | DavidSJ wrote:
           | You're presumably using a very small batch size compared to
           | what I described, thus getting very low model FLOP
           | utilization (MFU) and high dollar cost per token.
        
         | almostgotcaught wrote:
         | > High sparsity means you need a very large batch size
         | 
         | I don't understand what connection you're positing here? Do you
         | think sparse matmul is actually a matmul with zeros lol
        
           | DavidSJ wrote:
           | It's sparse as in only a small fraction of tokens are
           | multiplied by a given expert's weight matrices (this is
           | standard terminology in the MoE literature). So to properly
           | utilize the tensor cores (hence serve DeepSeek cheaply, as
           | the OP asks about) you need to serve enough tokens
           | concurrently such that the per-matmul batch dimension is
           | large.
        
             | almostgotcaught wrote:
             | i still don't understand what you're saying - you're just
             | repeating that a sparse matmul is a sparse matmul ("only a
             | small fraction of tokens are multiplied by a given expert's
             | weight matrices"). and so i'm asking you - do you believe
             | that a sparse matmul has low/bad arithmetic intensity?
        
               | DavidSJ wrote:
               | An MoE's matmuls have the same arithmetic intensity as a
               | dense model's matmuls, provided they're being multiplied
               | by a batch of activation vectors of equal size.
        
       | dist-epoch wrote:
       | Do the individual requests in a batch influence each-other?
       | 
       | Not in a floating point non-deterministic kind of way, where
       | exact ordering might introduce some non-determinism (begin
       | position 5th versus being position 10th in the batch lets say).
       | 
       | I'm asking in a semantic way, can context from one request leak
       | into another because they are in the same batch?
        
       | ipieter wrote:
       | This is an interesting blogpost. While the general conclusion
       | ("We need batching") is true, inference of mixture of experts
       | (MoE) models is actually a bit more nuanced.
       | 
       | The main reason we want big batches is because LLM inference is
       | not limited by the compute, but my loading every single weight
       | out of VRAM. Just compare the number of TFLOPS of an H100 with
       | the memory bandwidth, there's basically room for 300 FLOP per
       | byte loaded. So that's why we want big batches: we can perform a
       | lot of operations per parameter/weight that we load from memory.
       | This limit is often referred to as the "roofline model".
       | 
       | As models become bigger, this does not scale anymore because the
       | model weights will not fit into GPU memory anymore and you need
       | to distribute them across GPUs or across nodes. Even with NVLink
       | and Infiniband, these communications are slower than loading from
       | VRAM. NVlink is still fine for tensor parallelism, but across
       | nodes this is quite slow.
       | 
       | So what MoE allows is expert parallelism, where different nodes
       | keep different experts in memory and don't need to communicate as
       | much between nodes. This only works if there are enough nodes to
       | keep all experts in VRAM and have enough overhead for other stuff
       | (KV cache, other weights, etc). So naturally the possible batch
       | size becomes quite large. And of course you want to maximize this
       | to make sure all GPUs are actually working.
        
         | zozbot234 wrote:
         | You could load different "experts" in a round-robin way on a
         | single node and only aggregate "batches" opportunistically,
         | when you just have multiple requests in-flight that all happen
         | to rely on the same "expert". The difference being that instead
         | of "batches", you would only really have queues. Of course this
         | would come with a sizeable increase in latency, but that's
         | acceptable for many applications (such as for "deep research"
         | workflows)
        
           | jchrisa wrote:
           | This is very much like Erlang's actor model. The same compute
           | can be run in parallel, or managed via queues. With Erlang's
           | strong support for FFI and process control, I wonder if it's
           | being used as a dispatcher for these sorts of workloads.
        
         | iwontberude wrote:
         | And this is the investment case for AMD, models fit entirely in
         | a single chassis, and side benefit: less tariffed network
         | equipment to interconnect compute. Map/reduce instead of
         | clustered compute.
         | 
         | Edit: when downvoting, please offer some insight why you
         | disagree
        
           | dragonwriter wrote:
           | How is the a unique advantage for AMD?
        
             | latchkey wrote:
             | AMD is consistently stacking more HBM.                 H100
             | 80GB HBM3       H200 141GB HBM3e       B200 192GB HBM3e
             | MI300x 192GB HBM3       MI325x 256GB HBM3e       MI355x
             | 288GB HBM3e
             | 
             | This means that you can fit larger and larger models into a
             | single node, without having to go out over the network. The
             | memory bandwidth on AMD is also quite good.
        
               | krapht wrote:
               | So the MI300x has 8 different memory domains, and
               | although you can treat it as one flat memory space, if
               | you want to reach their advertised peak memory bandwidth
               | you have to work with it like an 8-socket board.
        
               | latchkey wrote:
               | Here is a good article on it:
               | 
               | https://rocm.blogs.amd.com/software-tools-
               | optimization/compu...
        
               | ryao wrote:
               | It really does not matter how much memory AMD has if the
               | drivers and firmware are unstable. To give one example
               | from last year:
               | 
               | https://www.tomshardware.com/pc-components/gpus/amds-
               | lisa-su...
               | 
               | They are currently developing their own drivers for AMD
               | hardware because of the headaches that they had with
               | ROCm.
        
               | latchkey wrote:
               | "driver" is such a generic word. tinygrad works on
               | mi300x. If you want to use it, you can. Negates your
               | point.
               | 
               | Additionally, ROCm is a giant collection of a whole bunch
               | of libraries. Certainly there are issues, as with any
               | large collection of software, but the critical thing is
               | whether or not AMD is responsive towards getting things
               | fixed.
               | 
               | In the past, it was a huge issue, AMD would routinely
               | ignore developers and bugs would never get fixed. But,
               | after that SA article, Lisa lit a fire under Anush's butt
               | and he's taking ownership. It is a major shift in the
               | entire culture at the company. They are extremely
               | responsive and getting things fixed. You can literally
               | tweet your GH issue to him and someone will respond.
               | 
               | What is true a year ago isn't today. If you're paying
               | attention like I am, and experiencing it first hand,
               | things are changing, fast.
        
               | ryao wrote:
               | I have been hearing this about AMD/ATI drivers for
               | decades. Every year, someone says that it is fixed, only
               | for new evidence to come out that they are not. I have no
               | reason to believe it is fixed given the history.
               | 
               | Here is evidence to the contrary: If ROCm actually was in
               | good shape, tinygrad would use it instead of developing
               | their own driver.
        
               | latchkey wrote:
               | We have all been hearing things for decades. Things are
               | noticeably different now. Live in the present, not in the
               | past.
               | 
               | Tinygrad isn't a driver. It is a framework. It is being
               | developed by George however he wants. If he wants to
               | build something that gives him more direct control over
               | things. Fine. Others might write PTX instead if using
               | higher level abstractions.
               | 
               | Fact is that tinygrad runs not only on AMD, but also
               | Nvidia and others. You might want to reassess your
               | beliefs because you're reading into things and coming up
               | with the wrong conclusions.
        
               | faldore wrote:
               | That was last year Mi300x firmware and software have
               | gotten much better since then
        
         | cyptus wrote:
         | could such a network with all its nodes and weights be deployed
         | to an analog circuit and be superfast?
        
           | rpmisms wrote:
           | Please go into more detail about this proposal, this piqued
           | my interest in a really strange way.
        
             | cyptus wrote:
             | The idea is to replicate the weights of the network in the
             | electronics. Somehow like our brains work? This way an
             | analog input signal could lead to a neural network
             | processed output signal without the digital emulation on an
             | gpu. As this is very much simplified, the question is if
             | this could work for modern llms?
        
               | koiueo wrote:
               | Suddenly "temperature" parameter starts making sense
               | 
               | (If you ever tried fine-tuning an analog circuit, you'll
               | know how finicky the process due to the environment,
               | including temperature)
        
               | cyptus wrote:
               | haha very true!
        
           | TuringNYC wrote:
           | Do you mean something like this? https://www.etched.com/
        
         | ryao wrote:
         | > As models become bigger, this does not scale anymore because
         | the model weights will not fit into GPU memory anymore and you
         | need to distribute them across GPUs or across nodes. Even with
         | NVLink and Infiniband, these communications are slower than
         | loading from VRAM. NVlink is still fine for tensor parallelism,
         | but across nodes this is quite slow.
         | 
         | Inference works by computing layers and then have a very small
         | vector that you send to the next layer as input. When a model
         | does not fit in a single GPU, you just divide it into layers
         | and send the vector over a fabric to the GPU holding the next
         | layer. The transfer happens so quickly that there is a
         | negligible amount of idle time and then the next layer can be
         | computed. The fastest inference on the planet at Cerebras uses
         | this technique to do 2500T/sec on Llama 4 Maverick.
        
           | jimmySixDOF wrote:
           | Groq and Cerebras both take a big chip approach to
           | architecture and, at least in the case of Groq, they only
           | make economic sense under high batch loads.
           | 
           | https://x.com/swyx/status/1760065636410274162?s=46
        
       | dgfitz wrote:
       | I am so sincerely amused that "we" figured out how to monetize
       | LLMs from the jump using tokens.
       | 
       | It isn't tech for techs sake, it's a money grab. Reminds me of
       | paying to send a text message or buying minutes for a phone plan.
       | Purely rent-seeking.
        
         | kaashif wrote:
         | Can you explain how this is rent seeking? It seems to be
         | straightforwardly not rent seeking.
         | 
         | 1. Company develops model, invests in research, hardware, and
         | software.
         | 
         | 2. Company sells access to the model.
         | 
         | (1) is the step that makes this not rent seeking.
         | 
         | Rent seeking is when you profit from something you didn't earn
         | - land rent, monopoly profits, protectionism.
        
           | dgfitz wrote:
           | That's fair. My thought was, when there is an interesting new
           | technology, it usually takes time to figure out how to
           | monetize it. Figuring out how to monetize LLMs took no time
           | at all.
        
             | davidmurdoch wrote:
             | "GPT 1.0" was released in 2018, I think that's a decent
             | amount of time.
        
               | dgfitz wrote:
               | We must have different definitions of released.
        
         | vikramkr wrote:
         | I don't think it's obvious that any of these model providers
         | are even profitable right now. I'm also not sure what there is
         | to "figure out" - it's an expensive technology where the cost
         | scales per token, so they charge per token? would you rather
         | they burned even more money giving it away for free until
         | everyone was dependent on it and then hyper enshittified to try
         | and not go broke like so much of the rest of tech?
        
           | dgfitz wrote:
           | My point, poorly made, was that I can run it myself for
           | "free" without caring about tokens at all. Tokens are an
           | artificial construct.
        
             | AndroTux wrote:
             | So by that logic all VPS providers are just a money grab
             | because you can run your software yourself for "free"
             | without having to pay for that artificial construct these
             | greedy people call "compute?"
             | 
             | I don't understand your point. You're using a resource.
             | You're wasting time on the GPU of someone else. That chunk
             | is called a token. And that's what you're being billed.
        
               | dgfitz wrote:
               | VPS providers don't put out articles espousing their
               | value twice a week because they don't need to, the value
               | is obvious.
               | 
               | I didn't mean to come off as argumentative. Again, in my
               | head it's so obvious what the end game is, and it isn't
               | to better humanity.
        
         | Workaccount2 wrote:
         | It's likely that no one who makes base models is currently
         | making money from LLMs. More likely losing it at a crazy rate.
         | 
         | These prices are almost certainly "introductory offer" prices
         | to get people/devs to integrate AI into their
         | lives/workflow/product.
         | 
         | In a few years is when we will see what the actual cost is.
        
       | imtringued wrote:
       | >It's a peculiar feature of transformer-based LLMs that computing
       | a batch of completions at the same time is almost as fast as
       | computing a single completion. Why is that?
       | 
       | Incorrect. Transformers usually contain a classical MLP layer.
       | Only the MLP layer can be batched. Hence all classical neural
       | networks including convolutional networks (via im2col) can be
       | batched.
       | 
       | If there's anything that the transformer architecture changes, it
       | is that the attention layer cannot be batched.
        
       | gok wrote:
       | MoE is in general kind of a stupid optimization. It seems to
       | require around 5x more total parameters for the same modeling
       | power as a dense model in exchange for around 2x less memory
       | bandwidth needs.
       | 
       | The primary win of MoE models seems to be that you can list an
       | enormous parameter count in your marketing materials.
        
         | hansvm wrote:
         | Stupid? By paying 5x (normally 2-4x, but whatever) of a thing
         | you don't care about at inference you can gain 2x in the
         | primary thing you care about at inference. It's like handing
         | out 4 extra bricks and getting back an extra lump of gold.
        
         | bick_nyers wrote:
         | The general rule of thumb when assessing MoE <-> Dense model
         | intelligence is SQRT(Total_Params*Active_Params). For Deepseek,
         | you end up with ~158B params. The economics of batch
         | inferencing a ~158B model at scale are different when compared
         | to something like Deepseek (it is ~4x more FLOPS per inference
         | after all), particularly if users care about latency.
        
       | philipodonnell wrote:
       | Isn't this an arbitrage opportunity? Offer to pay a fraction of
       | the cost per token but accept that your tokens will only be
       | processed when the batch window isn't big enough, then resell
       | that for a markup to people who need non-time sensitive
       | inference?
        
         | pama wrote:
         | You may have already noticed that many providers have separate,
         | much lower, prices for offline inference.
        
       | angry_octet wrote:
       | This is a great explainer from an LLM perspective, and it would
       | be interesting to see a computational scheduling explanation in
       | depth. I presume that hyperscale LLM companies extensively
       | examine the computation trace to identify bottlenecks and idle
       | bubbles, and develop load balancers, pipeline architectures and
       | schedulers in order to optimise their workload.
       | 
       | The batching requirement for efficiency makes high security
       | applications quite difficult, because the normal technique of
       | isolating unrelated queries would become very expensive. The
       | nVidia vGPU GPU virtualisation time shares GPU memory, and every
       | switch requires unload/reload context switches, doubtful they
       | have deduplication. Multi-Instance GPU (MIG) splits GPU memory
       | between users, but it is a fixed partitioning scheme (you have to
       | reboot the GPU to change), and nobody wants to split their 96GB
       | GPU into 4x24GB GPUs.
       | 
       | Makes me wonder what the tradeoff is for putting second level
       | memory on the GPU board (i.e. normal DRAM), so that different
       | matrix data can be loaded in faster than over PCIe, i.e. the HBM
       | becomes a cache.
       | 
       | (I'm also really liking the honesty in the authors book on
       | Software Engineering, not in the dry IEEE sense, but as a
       | survival guide in a large enterprise.
       | https://www.seangoedecke.com/book/ )
        
       | slavboj wrote:
       | It is not "slow and expensive", although it could be "or". You
       | can get 3 tokens / second running on DDR4 memory on a two
       | generation old workstation system that costs ~1K, via llama.cpp .
        
         | KolmogorovComp wrote:
         | You're most likely confusing the real deepseek with a distilled
         | version. Unless you have more than 192Gb of RAM.
        
       | bick_nyers wrote:
       | There's still a lot of opportunity for software optimizations
       | here. Trouble is that really only two classes of systems get
       | optimizations for Deepseek, namely 1 small GPU + a lot of RAM
       | (ktransformers) and the system that has all the VRAM in the
       | world.
       | 
       | A system with say 192GB VRAM and rest standard memory (DGX
       | station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory
       | run Deepseek @4bit quite quickly because of the power law type
       | usage of the experts.
       | 
       | If you aren't prompting Deepseek in Chinese, a lot of the experts
       | don't activate.
       | 
       | This would be an easier job for pruning, but still I think
       | enthusiast systems are going to trend in a way the next couple
       | years that makes these types of software optimizations useful on
       | a much larger scale.
       | 
       | There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4
       | interconnect which doesn't seem to be using full bandwidth during
       | tensor parallelism) that gets 7 token/s in llama.cpp. A single
       | 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39
       | times per second, so there's something else going on limiting
       | performance.
        
         | latchkey wrote:
         | A single MI300x has 192GB of vram.
        
         | MoonGhost wrote:
         | > 16x 3090 system
         | 
         | That's about 5KW of power
         | 
         | > that gets 7 token/s in llama.cpp
         | 
         | Just looking at electricity bill it's cheaper to use API of any
         | major providers.
         | 
         | > If you aren't prompting Deepseek in Chinese, a lot of the
         | experts don't activate.
         | 
         | That's interesting, it means the model can be cut and those
         | token routed to another closest expert, just in case they
         | happened.
        
       | corey_moncure wrote:
       | If I understand it correctly, the effect of experts is a weighted
       | sum of the individual calculation of each token meeting each
       | expert, where experts to be met by a token are selected on an
       | individual basis. Since a sum is commutative, though, it should
       | be possible to send a large batch of tokens copied to multiple
       | GPUs, where experts are streamed into VRAM, partitioned across
       | GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at
       | Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you
       | to upload a half precision quant of DeepSeek (about 360 GB) in
       | about 6 seconds.                 1 GPU  -  30 GB/s TX - 12
       | seconds       2 GPUs -  60 GB/s TX - 6 seconds       4 GPUs - 120
       | GB/s TX - 3 seconds
       | 
       | Then you just optimize your batch size to match the compute time
       | to the upload time of each GPU. The expert calculation results
       | can be retrieved from the GPUs and summed up.
        
       | briian wrote:
       | This reminded me that the economies of scale in AI, especially
       | inference, is huge.
       | 
       | When people say LLMs will be commoditised, I am not sure that
       | means that the market is going to be super competitive. As the
       | economies of scale of AI get even bigger (larger training costs +
       | batch inference etc.) it just seems likely only around 3
       | companies will dominate LLMs.
        
         | riku_iki wrote:
         | For inference cost, I don't see how this is different from
         | cloud providers vs dedicated server providers, where AWS is
         | 5-10x more expensive than hetzner.
         | 
         | Somehow cloud providers manage to add lots of extra-cost on
         | offering.
        
       | ryan_glass wrote:
       | I run Deepseek V3 locally as my daily driver and I find it
       | affordable, fast and effective. The article assumes GPU which in
       | my opinion is not the best way to serve large models like this
       | locally. I run a mid-range EPYC 9004 series based home server on
       | a supermicro mobo which cost all-in around $4000. It's a single
       | CPU machine with 384GB RAM (you could get 768GB using 64GB sticks
       | but this costs more). No GPU means power draw is less than a
       | gaming desktop. With the RAM limitation I run an Unsloth Dynamic
       | GGUF which, quality wise in real-world use performs very close to
       | the original. It is around 270GB which leaves plenty of room for
       | context - I run 16k context normally as I use the machine for
       | other things too but can up it to 24k if I need more. I get about
       | 9-10 tokens per second, dropping to 7 tokens/second with a large
       | context. There are plenty of people running similar setups with 2
       | CPUs who run the full version at similar tokens/second.
        
         | nardi wrote:
         | Whats your prompt processing speed? That's more important in
         | this situation than output TPS. If you have to wait minutes to
         | start getting an answer, that makes it much worse than a cloud-
         | hosted version.
        
           | pclmulqdq wrote:
           | I assume KV caching makes this a non issue, but I'm also
           | curious.
        
           | ryao wrote:
           | If he is doing multiturn conversations, he can reuse the kv
           | cache from the last turn and skip the prompt processing on
           | the history that would make time to first token too slow, by
           | only doing prompt processing on his actual prompt for the
           | current turn. This turns a quadratic amount of tokens to
           | process into a linear number. I am not sure if this is what
           | he is doing, but that is what I would do if I had his
           | hardware.
        
           | ryan_glass wrote:
           | Prompt eval time varies a lot with context but it feels real-
           | time for short prompts - approx 20 tokens per second but I
           | haven't done much benchmarking of this. When there is a lot
           | of re-prompting in a long back and forth it is still quite
           | fast - I do use KV cache which I assume helps and also
           | quantize the KV cache to Q8 if I am running contexts above
           | 16k. However, if I want it to summarize a document of say
           | 15,000 words it does take a long time - here I walk away and
           | come back in about 20 minutes and it will be complete.
        
         | jeff_carr wrote:
         | I am impressed. Your personal website is down. HN doesn't allow
         | private messages.
         | 
         | I'm Jeff Carr. I co-founded digital ocean. I assume I can't
         | post email addresses here, but I will try. lets see how smart
         | things are from banning me. I am: wit AT wit com
        
           | p12tic wrote:
           | State of the art of local models is even further.
           | 
           | For example, look into https://github.com/kvcache-
           | ai/ktransformers, which achieve >11 tokens/s on a relatively
           | old two socket Xeon servers + retail RTX 4090 GPU. Even more
           | interesting is prefill speed at more than 250 tokens/s. This
           | is very useful in use cases like coding, where large prompts
           | are common.
           | 
           | The above is achievable today. In the mean time Intel guys
           | are working on something even more impressive. In
           | https://github.com/sgl-project/sglang/pull/5150 they claim
           | that they achieve >15 tokens/s generation and >350 tokens/s
           | prefill. They don't share what exact hardware they run this
           | on, but from various bits and pieces over various PRs I
           | reverse-engineered that they use 2x Xeon 6980P with MRDIMM
           | 8800 RAM, without GPU. Total cost of such setup will be
           | around $10k once cheap Engineering samples hit eBay.
        
             | qeternity wrote:
             | It's not impressive nor efficient when you consider batch
             | sizes > 1.
        
               | p12tic wrote:
               | All of this is for batch size 1.
        
         | pclmulqdq wrote:
         | CPUs are quietly becoming very well-balanced machines for BS 1
         | inference. The latest Intel Xeons should be at ~20 TPS.
        
           | Spooky23 wrote:
           | A base Mac Mini is ~20 :)
        
             | pclmulqdq wrote:
             | Oh yeah, I did that math not assuming any quantization. I
             | think if you can get a 3-4 bit quant working + int8 math,
             | ~80 might be achievable.
        
         | platevoltage wrote:
         | Impressive. I need to look more into this. I'm doing my best to
         | limit my LLM usage to what I can run locally.
        
         | jbellis wrote:
         | impressive, but that's 1/5 to 1/10 of the throughput that you'd
         | get with a hosted provider, with 1/4 to 1/8 the supported
         | context
        
           | michelsedgh wrote:
           | Dude he's running locally, and I think this setup is the best
           | bang for the buck if you wanna run locally, we're not
           | comparing to data centers, you gotta keep it in perspective.
           | That's very impressive results for running local. Thanks for
           | the numbers you saved me a chatgpt search :)
        
             | carstenhag wrote:
             | Title says: locally it's expensive
             | 
             | Other person says: I had to spend 4000$ and it's still slow
        
         | refibrillator wrote:
         | > Unsloth Dynamic GGUF which, quality wise in real-world use
         | performs very close to the original
         | 
         | How close are we talking?
         | 
         | I'm not calling you a liar OP, but in general I wish people
         | perpetuating such broad claims would be more rigorous.
         | 
         | Unsloth does amazing work, however as far as I'm aware even
         | they themselves do not publish head to head evals with the
         | original unquantized models.
         | 
         | I have sympathy here because very few people and companies can
         | afford to run the original models, let alone engineer rigorous
         | evals.
         | 
         | However I felt compelled to comment because my experience does
         | not match. For relatively simple usage the differences are hard
         | to notice, but they become much more apparent in high
         | complexity and long context tasks.
        
           | ryan_glass wrote:
           | You are right that I haven't been rigorous - it's easy to
           | benchmark tokens/second but quality of output is more
           | difficult to nail down. I couldn't find any decent
           | comparisons for Unsloth either. So I just tried a few of
           | their models out, looking for something that was 'good
           | enough' i.e. does all I need: coding, summarizing documents,
           | troubleshooting anything and everything. I would like to see
           | head to head comparisons too - maybe I will invest in more
           | RAM at some stage but so far I have no need for it. I ran
           | some comparisons between the smaller and larger versions of
           | the Unsloth models and interestingly (for me anyway) didn't
           | notice a huge amount of difference in quality between them.
           | But, the smaller models didn't run significantly faster so I
           | settled for the biggest model I could fit in RAM with a
           | decent context. For more complex coding I use Deepseek R1
           | (again the Unsloth) but since it's a reasoning model it isn't
           | real-time so no use as my daily driver.
        
         | 3eb7988a1663 wrote:
         | Do you have hard numbers on the idle/average/max power draw? I
         | assumed that server machines are built as if they are going to
         | red-lined constantly so put less effort into low-utilization
         | optimizations.
        
           | ryan_glass wrote:
           | No hard numbers I'm afraid in that I don't monitor the power
           | draw. But the machine uses a standard ATX power supply: a
           | Corsair RM750e 750W PSU and the default TDP of the CPU is
           | 280W - I have my TDP set at 300W. It is basically built like
           | a desktop - ATX form factor, fans spin down at idle etc.
        
         | dotancohen wrote:
         | Just curious what your use cases are? What type of texts are
         | you producing?
         | 
         | Thank you.
        
       | fdfofeoijfeoie wrote:
       | Related: https://stackoverflow.com/q/79454372/320615
        
       | cycomanic wrote:
       | I was talking with a colleague the other day and we came to the
       | conclusion that in our experience if you're using llms as a
       | programming help models are really being optimised for the wrong
       | things.
       | 
       | At work I often compare locallly run 4-30B models against various
       | GPTs (we can only use non-local models for few things, because of
       | confidentiality issues). While e.g. GPT-4o gives better results
       | on average, the chances of it making parts of the response up is
       | high enough that one has to invest significant amount to check
       | and iterate over results. So the difference in effort is not much
       | lower compared to the low parameter models.
       | 
       | The problem is both are just too slow to really iterate quickly,
       | which makes things painful. I'd rather have a lower quality model
       | (but with large context) that gives me near instant responses
       | instead of a higher quality model that is slow. I guess that's
       | not giving you the same headlines as the improved score on some
       | evaluation.
        
       ___________________________________________________________________
       (page generated 2025-06-01 23:00 UTC)