[HN Gopher] Cost of self hosting Llama-3 8B-Instruct
       ___________________________________________________________________
        
       Cost of self hosting Llama-3 8B-Instruct
        
       Author : veryrealsid
       Score  : 192 points
       Date   : 2024-06-14 15:30 UTC (7 hours ago)
        
 (HTM) web link (blog.lytix.co)
 (TXT) w3m dump (blog.lytix.co)
        
       | philipkglass wrote:
       | _Instead of using AWS another approach involves self hosting the
       | hardware as well. Even after factoring in energy, this does
       | dramatically lower the price._
       | 
       |  _Assuming we want to mirror our setup in AWS, we'd need 4x
       | NVidia Tesla T4s. You can buy them for about $700 on eBay.
       | 
       | Add in $1,000 to setup the rest of the rig and you have a final
       | price of around:
       | 
       | $2,800 + $1,000 = $3,800_
       | 
       | This whole exercise assumes that you're using the Llama 3 8b
       | model. At full fp16 precision that will fit in one 3090 or 4090
       | GPU (the int8 version will too, and run faster, with very little
       | degradation.) Especially if you're willing to buy GPU hardware
       | from eBay, that will cost significantly less.
       | 
       | I have my home workstation with a 4090 exposed as a vLLM service
       | to an AWS environment where I access it via reverse SSH tunnel.
        
         | causal wrote:
         | Came here to say this. No way you need to spend more than $1500
         | to run L3 8B at FP16. And you can get near-identical
         | performance at Q8 for even less.
         | 
         | I'm guessing actual break-even time is less than half that, so
         | maybe 2 years.
        
           | causal wrote:
           | Furthermore, the AWS estimates are also really poorly done.
           | Using EKS this way is really inefficient, and a better
           | comparison would be AWS Bedrock Haiku which averages $0.75/M
           | tokens: https://aws.amazon.com/bedrock/pricing/
           | 
           | This whole post makes OpenAI look like a better deal than it
           | actually is.
        
             | mrinterweb wrote:
             | I was getting that sense too. It would not be difficult to
             | build a desktop machine with a 4090 for around $2500. I run
             | Llama-3 8b on my 4090, and it runs well. Plus side is I can
             | play games with the machine too :)
        
         | shostack wrote:
         | How is inference latency for coding use cases on a local 3090
         | or 4090 compared to say, hitting the GPT-4o API?
        
           | whereismyacc wrote:
           | I assume the characteristics would be pretty different, since
           | your local hardware can keep the context loaded in memory,
           | unlike APIs which I'm guessing have to re-load it for each
           | query/generation?
        
             | christina97 wrote:
             | If you integrate with existing tooling, it won't do this
             | optimization. Unless of course you really go crazy with
             | your setup.
        
               | moffkalast wrote:
               | Setting one launch flag on llama.cpp server hardly
               | qualifies as going crazy with one's setup.
        
         | kiratp wrote:
         | Nvidia EULA prevents you from using consumer gaming GPUs in
         | datacenters so 4xxx cards are a non-starter for any service
         | usecases
         | 
         | EDIT: TOS -> EULA per comments below
        
           | nubinetwork wrote:
           | That never stopped the crypto farmers...
        
             | byteknight wrote:
             | They also weren't selling the usage of the cards.
        
           | oneshtein wrote:
           | Nvidia terms of what?
        
             | codetrotter wrote:
             | Parent commenter used the wrong word. It's the EULA that
             | prevents it.
             | 
             | Regardless, it is true that it is a problem.
             | 
             | https://www.reddit.com/r/MachineLearning/comments/ikrk4u/d_
             | c...
        
           | J_Shelby_J wrote:
           | What about on prem? Like, my small business needs an LLM. Can
           | I put a 3090 in a box in a closet?
           | 
           | What if I'm a business and I'm selling LLMs in a box for you
           | to put on a private network?
           | 
           | What constitutes a data center according to the ToS? Is it
           | enforceable if you never agree to the ToS (buying through
           | eBay?)
        
             | kiratp wrote:
             | By using the drivers you agree to their TOS. So yes, it
             | applies even on your private network.
        
               | swatcoder wrote:
               | The customer limitation described in the EULA is exactly
               | this:
               | 
               | > No Datacenter Deployment. The SOFTWARE is not licensed
               | for datacenter deployment, except that blockchain
               | processing in a datacenter is permitted.
               | 
               | - https://www.nvidia.com/content/DriverDownloads/licence.
               | php?l...
               | 
               | There's no further elaboration on what "datacenter" means
               | here, and it's a fair argument to say that a closet with
               | one consumer-GPU-enriched PC is not a "datacenter
               | deployment". The odds that Nvidia would pursue a claim
               | against an individual or small business who used it that
               | way is infinitesimal.
               | 
               | So both the ethical issue (it's a fair-if-debatable read
               | of the clause) and the practical legal issue (Nvidia
               | wouldn't bother to argue either way) seem to say one
               | needn't worry about.
               | 
               | The clause is there to deter at-scale commercial service
               | providers from buying up the consumer card market.
        
             | light_hue_1 wrote:
             | Don't listen to this person. They have no idea what they're
             | talking about.
             | 
             | No one cares about this TOS provision. I know both startups
             | and large businesses that violate it as well as industry
             | datacenters and academic clusters. There are companies that
             | explicitly sell you hardware to violate it. Heck, Nvidia
             | will even give you a discount when you buy the hardware to
             | violate it in large enough volume!
             | 
             | You do you.
        
               | wongarsu wrote:
               | In a previous AI wave hosters like OVH and Hetzner
               | started offering servers with GTX 1080 at prices other
               | hosters with datacenter-grade GPUs couldn't possibly
               | compete with - and VRAM wasn't as big of a deal back
               | then. That's who this clause targets.
               | 
               | If you don't rent our servers or VMs Nvidia doesn't care.
               | They aren't Oracle.
        
           | giancarlostoro wrote:
           | It's not in a data center, it's in his home.
        
           | badgersnake wrote:
           | How would they even know?
        
           | jtriangle wrote:
           | There are no nvidia police, they literally cannot stop you
           | from doing this.
        
         | choppaface wrote:
         | Yeah but this article is terrible. First it talks about naively
         | copy-pasting code to get "a seeming 10x speed-up" and then
         | "This ended up being incorrect way of calculating the tokens
         | used."
         | 
         | I would not bank on anything in this article. It might as well
         | have been written by a tiny Llama model.
        
         | czhu12 wrote:
         | I do the same thing with cloudflare tunnels and managing the
         | cloudflare tunnel process and the llama.cpp server with systemd
         | on my home internet.
         | 
         | Have a 13B running on a 3070 with 16 gpu layers and the rest
         | running off CPU.
         | 
         | Performs okay, but way cheaper than renting a GPU on the cloud.
        
         | logtrees wrote:
         | Whoa, so you have code running in AWS making use of your local
         | hardware via what is called a reverse SSH tunnel? I will have
         | to look into how that works, that's pretty powerful if so. I
         | have a mac mini that I use for builds and deploys via FTP/SFTP
         | and was going to look into setting up "messaging" via that
         | pipeline to access local hardware compute through file messages
         | lol, but reverse SSH tunnel sounds like it'll be way better for
         | directly calling executables rather than needing to parse
         | messages from files first.
        
           | brrrrrm wrote:
           | I use my mac mini exactly as described by the parent post but
           | using ollama as the server. Super easy setup and obv chatgpt
           | can guide you through it
        
             | logtrees wrote:
             | Unfortunately my mac mini isn't beefy enough to run ollama,
             | it's the base model m1 from a couple years ago lol. But
             | it's very powerful for builds, deploys, and some
             | computation via scripts. Now I'm curious to check out how
             | much memory the newest ones support for potentially using
             | ollama on it haha. Thanks!
        
               | brrrrrm wrote:
               | Mine is also an m1. Just use llama3, its 8b quantized by
               | default
        
               | logtrees wrote:
               | I will try it out, curious to see how it will work with
               | 8gb of memory haha. Thanks for the heads up!
        
               | apnew wrote:
               | Do you happen to have any handy guides/docs/references
               | for absolute beginners to follow?
        
               | paulmd wrote:
               | Ollama is not as powerful as llama.cpp or raw pytorch,
               | but it is almost zero effort to get started.
               | 
               | brew install ollama; ollama serve; ollama pull llama3:
               | 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M
               | 
               | https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M
               | 
               | (It may need to be Q4 or Q3 instead of Q5 depending on
               | how the RAM shakes out. But the Q5_K_M quantization
               | (k-quantization is the term) is generally the best
               | balance of size vs performance vs intelligence if you can
               | run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of
               | course even better but you're nowhere near fitting that
               | on 8gb.)
               | 
               | https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/over
               | vie...
               | 
               | Dolphin-llama3 is generally more compliant and I'd
               | recommend that over just the base model. It's been fine-
               | tuned to filter out the dumb "sorry I can't do that"
               | battle, and it turns out this also increases the quality
               | of the results (by limiting the space you're generating,
               | you also limit the quality of the results).
               | 
               | https://erichartford.com/uncensored-models
               | 
               | https://arxiv.org/abs/2308.13449
               | 
               | Most of the time you will want to look for an "instruct"
               | model, if it doesn't have the instruct suffix it'll
               | normally be a "fill in the blank" model that finishes
               | what it thinks is the pattern in the input, rather than
               | generate a textual answer to a question. But ollama
               | typically pulls the instruct models into their repos.
               | 
               | (sometimes you will see this even with instruct models,
               | especially if they're misconfigured. When llama3 non-
               | dolphin first came out I played with it and I'd get
               | answers that looked like stackoverflow format or quora
               | format responses with ""scores"" etc, either as the full
               | output or mixed in. Presumably a misconfigured model, or
               | they pulled in a non-instruct model, or something.)
               | 
               | Dolphin-mixtral:8x7b-v2.7 is where things get really
               | interesting imo. I have 64gb and 32gb machines and so far
               | the Q6 and q4-k_m are the best options for those
               | machines. dolphin-llama3 is reasonable but dolphin-
               | mixtral is a richer better response.
               | 
               | I'm told there's better stuff available now, but not sure
               | what a good choice would be for for 64gb and 32gb if not
               | mixtral.
               | 
               | Also, just keep an eye on r/LocalLLaMA in general, that's
               | where all the enthusiasts hang out.
        
           | verdverm wrote:
           | using Tailscale can make the networking setup much easier,
           | really like their service for things like this (or curling
           | another dev's local running server)
        
           | sneak wrote:
           | Look into Nebula (or Tailscale if you trust third parties). I
           | have all my workstations and servers on a mesh network that
           | appears as a single /24 that is end to end encrypted,
           | mutually authenticated and works through/behind NAT. I can
           | spawn a vhost on any server that reverse proxies an API to
           | any port on any machine.
           | 
           | It's been an absolute gamechanger.
        
             | logtrees wrote:
             | Whooooaaa that is mind-blowing. Thanks for sharing. <3
        
             | elorant wrote:
             | Is there any resource that goes into more detail about how
             | to setup all this?
        
               | sneak wrote:
               | https://github.com/slackhq/nebula
               | 
               | the docs are good. when creating the initial CA make
               | absolutely sure you set the CA expiration to 10-30 years,
               | the default is 1 which means your whole setup explodes in
               | a year without warning.
        
             | aborsy wrote:
             | Why do you have to trust a third party?
             | 
             | It's end to end encrypted, and with tail lock enabled,
             | nodes can not be added without user's permission.
        
             | 1oooqooq wrote:
             | why either of these over plain wireguard if you're not
             | provisioning accounts?
        
               | sneak wrote:
               | Wireguard doesn't do nat punching and is not mesh, it's
               | p2p only.
               | 
               | totally different use case.
        
           | favflam wrote:
           | You can also check if you have ipv6. I have tried both, but
           | prefer directly connecting home.
        
         | hehdhdjehehegwv wrote:
         | I dropped $5k on an A6000 and I can run llama3:70b day and
         | night for the price of my electricity bill.
         | 
         | I've gone through hundreds of millions, maybe billions, of
         | tokens in the past year.
         | 
         | This article is just "cloud is expensive" 101. Nothing new.
        
           | brcmthrowaway wrote:
           | Hows your ROI?
        
             | hehdhdjehehegwv wrote:
             | Absolutely phenomenal.
        
           | logicallee wrote:
           | Super cool, thanks for sharing. Do you mind sharing what you
           | used the hundreds of millions (or billions) of tokens on?
        
           | hereonout2 wrote:
           | I've worked professionally over the last 12 months hosting
           | quite a few foundation models and fine tuned LLMs on our own
           | hardware, aws + azure vms and also a variety of newer
           | "inference serving" type services that are popping up
           | everywhere.
           | 
           | I don't do any work with the output, I'm just the MLOps guy
           | (ahem, DevOps).
           | 
           | You mention expense but on a purely financial basis I find
           | any of these hosted solutions really hard to justify against
           | GPT 3.5 turbo prices, including building your own rig. $5k +
           | electricity is loads of 3.5 Turbo tokens.
           | 
           | Of course none of the data scientists or researchers I work
           | with want to use that though - it's not their job to host
           | these things or worry about the costs.
        
           | elorant wrote:
           | Is this at 4-bit quantization? And how many tokens per second
           | is the output?
        
           | EvgeniyZh wrote:
           | 1B of tokens for Gemini Flash (which is on par with
           | llama3-70b in my experience or even better sometimes) with
           | 2:1 input-output would cost ~600 bucks (ignoring the fact
           | they offer 1M tokens a day for free now). Ignoring
           | electricity you'd break even in >8 years. You can find
           | llama3-70b for ~same prices if you're interested in the
           | specific model.
        
         | cootsnuck wrote:
         | Yea, for any hobbyist, indie developer, etc. I think it'd be
         | ridiculous to not first try running one of these smaller (but
         | decently powerful) open source models on your own hardware at
         | home.
         | 
         | Ollama makes it dead simple just to try it out. I was
         | pleasantly surprised by the tokens/sec I could get with Llama 3
         | 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never
         | use. Would be super cool to just have a LLM server on my local
         | network for me and the fam. Exciting times.
        
         | speakspokespok wrote:
         | Why did this only occur to me recently? You can selfhost a k8s
         | cluster and expose the services using a $5 digital ocean
         | droplet. The droplet and k8s services are point-to-point
         | connected using tailscale. Performance is perfectly fine, keeps
         | your skillset sharp, and you're self-hosting!
        
           | Helithumper wrote:
           | You can also just directly connect to containers using
           | Tailscale if it's just for internal use. That is, having an
           | internally addressable `https://container_name` on your
           | tailnet per-container if you want. This way I can setup
           | Immich for example and it's just on my tailnet at
           | `https://immich` without the need for a reverse proxy, etc...
           | 
           | https://tailscale.com/blog/docker-tailscale-guide
        
             | SparkyMcUnicorn wrote:
             | And you can use Tailscale Funnel to serve it publicly. No
             | need to pay for a cloud instance.
             | 
             | https://tailscale.com/kb/1223/funnel
        
       | liquidise wrote:
       | Great mix of napkin math and proper analysis, but what strikes me
       | most is how cheap LLM access is. For it being relatively bleeding
       | edge, us splitting hairs on < $20/M tokens is remarkable itself,
       | and something tech people should be thrilled about.
        
         | refulgentis wrote:
         | Smacks of the "starving kids in Africa" fallacy, you could make
         | the same argument that tech people should be thrilled for
         | current thing being available at $X for X =
         | $2/$20/$200/$2000...
        
       | theogravity wrote:
       | The energy costs in the bay area are double the reported 24c
       | cost, so energy alone would be around $100-ish a month instead of
       | $50-ish.
        
         | pkaye wrote:
         | Unless you are in Santa Clara with Silicon Valley Power rates.
         | 
         | https://www.siliconvalleypower.com/residents/rates-and-fees
        
         | veryrealsid wrote:
         | Yeah agreed, some of the areas we have access to were 16c (PA)
         | and up to 24c (NYC), doubled that cost in the analysis because
         | of things like this
        
         | angoragoats wrote:
         | Except that the article assumes that the GPUs would be using
         | their max TDP all the time, which is incorrect. GPUs will
         | throttle down to 5-20w (depending on the specific GPU). So your
         | actual power consumption is going to be much, much lower,
         | unless you're literally using your LLM 24/7.
        
       | causal wrote:
       | No way you need $3,800 to run an 8B model. 3090 and a basic rig
       | is enough.
       | 
       | That being said, the difference between OpenAI and AWS cost ($1
       | vs $17) is huge. Is OpenAI just operating at a massive loss?
       | 
       | Edit: Turns out AWS is actually cheaper if you don't use the
       | terrible setup in this article, see comments below.
        
         | throwup238 wrote:
         | AWS's pricing is just ridiculous. Their 1-year reserve pricing
         | for an 8x H100 or A100 instance (p4/p5) costs just as much as
         | buying the machine outright with tens of thousands left over
         | for the NVIDIA enterprise license and someone to manage them
         | (per instance!). Their on demand pricing is even more insane -
         | they're charging $3.x/hr for six year old cards.
        
           | readams wrote:
           | What about the cost of the power and cooling to run the
           | machine (a lot!), and the staff to keep it running?
        
             | throwup238 wrote:
             | That's why I said "and someone to manage them". The
             | difference is in the tens of thousands of dollars _per
             | instance_. The savings from even a dozen instances is
             | enough to pay for someone to manage them full time, and
             | that 's just for the first year. Year 2 and 3 you're saving
             | six figures per instance so you'd be able to afford one
             | person per machine to hand massage them like some fancy
             | kobe beef.
             | 
             | A100 TDP is 400W so assuming 4kW for the whole machine,
             | that's a little more than $5k/year at $0.15/kWh. Again, the
             | difference is in the tens of thousands _per instance_. Even
             | at 50% utilization over three years, if you need more than
             | a dozen machines it 's much cheaper to buy them outright,
             | especially on credit.
        
         | refulgentis wrote:
         | I mean, no, I came to scan the comments quick after reading
         | because there's a lot of bad info you can walk away with from
         | the post, it's sort of starting from scratch on hosting LLMs
         | 
         | If you keep reading past there, they get it down significantly.
         | The 8 tkn/s number AWS was evaluated on is really funny, that's
         | about what you'd get on last year's iphone and it's not because
         | apples special, it's because theres barely any reasonable
         | optimization being done here. No batching, float32 weights (8
         | bit is guaranteed indistinguishable from 32 bit, 5 bit tests as
         | definitely indistinguishable in blind tests, 4 bit arguably is
         | indistinguishable)
        
           | causal wrote:
           | You're right. In fact, using EKS at all is silly when AWS
           | offers their Bedrock service with Claude Haiku (rated #19 on
           | Chat Arena vs. ChatGPT3.5-Turbo at #37) for a much lower cost
           | of $0.75/M tokens (averaging input and output like OP
           | does)[0].
           | 
           | So in reality AWS is cheaper for a much better model if you
           | don't go with a wildly suboptimal setup.
           | 
           | [0] https://aws.amazon.com/bedrock/pricing/
        
         | throwaway240403 wrote:
         | I thought it was generally known they were operating at a loss?
         | 
         | even with the subs and api charges, they still let people use
         | chatGPT for free with no monetization options. Sure they are
         | collecting the data for training, but that's hard to quantify
         | the value of.
        
       | jezzarax wrote:
       | llama.cpp + llama-3-8b in Q8 run great on a single T4 machine.
       | Cannot remember the TPS I got there, but it was much above 6
       | mentioned in the article.
        
         | veryrealsid wrote:
         | Interesting, I got very different results depending on how I
         | ran the model, will definitely give this a try!
         | 
         | edit: Actually could you share how long it took to make a
         | query? One of our issues is we need it to respond in a fast
         | time frame
        
           | jezzarax wrote:
           | I checked some logs from my past experiments, the decoding
           | went for about 400 tps over a ~3k token query, so about 7
           | seconds to process it, and then the generation speed was
           | about 28 tokens.
        
       | throwup238 wrote:
       | The T4 is a six year old card. A much better comparison would be
       | a 3090, 4090, A10, A100, etc.
        
       | michaelmior wrote:
       | There's also the option of platforms such as BentoML (I have no
       | affiliation) that offer usage-based pricing so you can at least
       | take the 100% utilization assumption off the table. I'm not sure
       | how the price compares to EKS.
       | 
       | https://www.bentoml.com/
        
       | barbegal wrote:
       | There's some dodgy maths
       | 
       | >( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738
       | 
       | Should be $0.64 so still expensive
        
         | jasonjmcghee wrote:
         | being 6 orders of magnitude off in your cost calculation isn't
         | great.
         | 
         | groq costs about that for llama 3 70b (which is a monumentally
         | better model) and 1/10th of that for llama 3 8b
        
           | pants2 wrote:
           | Groq doesn't currently have a paid API that one can sign up
           | for.
        
             | jasonjmcghee wrote:
             | Yup. True. Should say "will" - currently free but heavily
             | rate-limited. Together AI looks to be about $0.30 / 1M
             | tokens, as another price comparison. Which you can pay for.
        
       | gradus_ad wrote:
       | I wonder how long NVIDIA can justify its current market cap once
       | people realize just how cheap it is to run inference on these
       | models given that LLM performance is plateauing, LLM's as a whole
       | are becoming commoditized, and compute demand for training will
       | drop off a cliff sooner than people expect.
        
         | nextworddev wrote:
         | It's actually about training, not inference. You can't do
         | training on commodity gpus but yeah once someone figures that
         | out, nvdia could crash
        
           | gradus_ad wrote:
           | I know, my point is that when training demand decreases
           | people will be realize that inference does not make up the
           | difference
        
             | nextworddev wrote:
             | Yeah the big question I'm struggling with is exactly when
             | training demand will fall if at all
        
               | sroussey wrote:
               | Every research lab is focused on new architectures that
               | would reduce training costs.
        
               | nextworddev wrote:
               | Yeah we need essentially hadoop for llm training
        
           | amluto wrote:
           | Nvidia doesn't obviously have a strong inference play right
           | now for a widely-deployed small model. For a model that
           | really needs a 4090, maybe. But for a model that can run on a
           | Coral chip or an M1/M2/M3 or whatever Intel or AMD's latest
           | little AI engines can do? This market has plenty of players,
           | and Nvidia doesn't seem to be anywhere near the lead except
           | insofar as it's a little bit easier to run the software on
           | CUDA.
        
         | smokel wrote:
         | As someone else points out, training is slightly more involved,
         | but I also find that these smaller models are next to worthless
         | compared to the larger ones.
         | 
         | There are probably some situations where it suffices to use a
         | small model, but for most purposes, I'd prefer to use the state
         | of the art, and I'm eager for that state to progress a little
         | more.
        
         | dwaltrip wrote:
         | > LLM performance is plateauing
         | 
         | It's a wee bit early to call this. Let's see what the top labs
         | release in the next year or two, yeah?
         | 
         | GPT-4 was released only 15 months ago, which was about 3 years
         | after GPT-3 was released.
         | 
         | These things don't happen overnight, and many multi-year
         | efforts are currently in the works, especially starting last
         | year.
        
         | epolanski wrote:
         | I am a partial believer that the real race for many tech
         | players is actually AGI and ASI later and till the problem is
         | solved the hardware arm race will keep being part of it.
         | 
         | Not only big tech is part of it but billion dollars startups
         | are popping everywhere from China to US and Middle East.
        
           | chrisdbanks wrote:
           | Unless we hit another AI winter. We might get to the point
           | where the hardware just can't give better returns and have to
           | wait another 20 years for the next leap forward. We're still
           | orders of magnitude away from the human brain.
        
         | riku_iki wrote:
         | > I wonder how long NVIDIA can justify its current market cap
         | once people realize just how cheap it is to run inference on
         | these models given that LLM performance is plateauing
         | 
         | next wave driving demand can be actual new products developed
         | on LLMs. There are very few usecases currently well developed
         | besides chatbots, but potential is very large.
        
       | throwaway2016a wrote:
       | Llama-3 is one of the models provided by AWS Bedrock which offers
       | pay as you go pricing. I'm curious how it would break down on
       | that.
       | 
       | LLAMA 8B on Bedrock is $0.40 per 1M input tokens and $0.60 per 1M
       | output tokens which is a lot cheaper than OpenAI models.
       | 
       | Edit: to add to that, as technical people we tend to discount the
       | value of our own time. Bedrock and the OpenAI are both very easy
       | to integrate with and get started. How long did this server take
       | to build? How much time does it take to maintain and make sure
       | all the security patches are applied each month? How often does
       | it crash and how much time will be needed to recover it? Do you
       | keep spare parts on hand / how much is the cost of downtime if
       | you have to wait to get a replacement part in the mail? That's
       | got to be part of the break-even equation.
        
         | croddin wrote:
         | Groq also has pay as you go pricing for llama3 8B for only
         | $0.05/$0.08 that is very fast.
        
           | sergiotapia wrote:
           | Groq is actually allowing you to pay now and get real
           | service?
        
             | coder543 wrote:
             | The option to pay is still listed as coming soon, but I
             | also see pricing information in the settings page, so maybe
             | it actually is coming somewhat sooner. I'm seeing $0.05/1M
             | input and $0.10/1M output for llama3 8B, which is not
             | exactly identical to what the previous person quoted.
             | 
             | Either way, I wish Groq _would_ offer a real service to
             | people willing to pay.
        
               | croddin wrote:
               | I found the .05/.08 here: https://wow.groq.com/
        
             | refulgentis wrote:
             | tl;dr: no-ish, it's getting better but still not there.
             | 
             | I don't really get it, only thing I can surmise is it'd be
             | such a no-brainer in various cases, that if they tried
             | supporting it as a service, they'd have to cut users. I've
             | seen multiple big media company employees begging for some
             | sort of response on their discord.
        
           | localfirst wrote:
           | didn't know they finally turned on pricing plans
        
         | VagabundoP wrote:
         | Just to bounce off this a little. If you are looking to fine-
         | tune using an on demand service it seems Amazon Sagemaker can
         | do it at seemingly decent prices:
         | 
         | https://aws.amazon.com/sagemaker/pricing/
         | 
         | I'd love to hear someones experience using this as I want to
         | make an RPG rules bot tied to a specific ruleset as a project
         | but I fear AWS as it might bankrupt me!
        
           | zsyllepsis wrote:
           | In my experience SageMaker was relatively straightforward for
           | fine-tuning models that could fit on a single instance, but
           | distributed training still requires a good bit of detailed
           | understanding of how things work under the covers. SageMaker
           | Jumpstart includes some pretty easy out-of-the-box
           | configurations for fine-tuning models that are a good
           | starting point. They will incorporate some basic quantization
           | and other cost-savings techniques to help reduce the total
           | compute time.
           | 
           | To help control costs, you can choose pretty conservative
           | settings in terms of how long you want to let the model train
           | for. Once that iteration is done and you have a model
           | artifact saved, you can always pick back up and perform more
           | rounds of training using the previous checkpoint as a
           | starting point.
        
         | veryrealsid wrote:
         | > How long did this server take to build?
         | 
         | About 3 days [from 0 and iterating multiple times to the final
         | solution]
         | 
         | > How much time does it take to maintain and make sure all the
         | security patches are applied each month?
         | 
         | A lot
         | 
         | > How often does it crash and how much time will be needed to
         | recover it? Do you keep spare parts on hand / how much is the
         | cost of downtime if you have to wait to get a replacement part
         | in the mail?
         | 
         | All really good points, the exercise to self host is really
         | just to see what is possible but completely agree that self
         | hosting makes little to no sense unless you have a business
         | case that can justify it.
         | 
         | Not to mention if you sign customers with SLAs and then end up
         | having downtime would put even more pressure on your self
         | hosted hardware
        
         | johnklos wrote:
         | > How long did this server take to build? How much time does it
         | take to maintain and make sure all the security patches are
         | applied each month? How often does it crash and how much time
         | will be needed to recover it? Do you keep spare parts on hand /
         | how much is the cost of downtime if you have to wait to get a
         | replacement part in the mail? That's got to be part of the
         | break-even equation.
         | 
         | All of these are the kinds of things that people say to non-
         | technical people to try to sell cloud. It's all fluff.
         | 
         | Do you _really_ think that cloud computing doesn 't have
         | security issues, or crashes, or data loss, or that it doesn't
         | involve lots of administration? Thinking that we don't know any
         | better is both disingenuous and a bit insulting.
        
           | websap wrote:
           | I've managed fleets on cloud providers with over 100k
           | instances, even with all the excellent features through APIs,
           | managing instances can quickly get tricky.
           | 
           | Tbh, your comment is kind of insulting and belittles how far
           | we've come ahead in infrastructure management.
           | 
           | The cloud is probably more secure than a set of janky servers
           | that you have running in your basement. You can totally
           | automate away 0-days, cves and get access to better security
           | primitives.
        
             | johnklos wrote:
             | If my comment is insulting, I apologize. That was not my
             | intention. My intention was to say that writing sales speak
             | in a technical discussion is insulting to those of us who
             | know better.
             | 
             | However, you've now gone out of your way to try to be
             | insulting. You know nothing about me, yet you want to
             | suggest that the cloud is more secure than my servers, and
             | that my servers are "janky"?
             | 
             | Please try a little harder to engage in reasonable
             | discourse.
        
             | yjftsjthsd-h wrote:
             | > The cloud is probably more secure than a set of janky
             | servers that you have running in your basement.
             | 
             | Apples/Oranges. Your janky cloud[0] is less secure than the
             | servers in my basement, because I'm a mostly competent
             | sysadmin. Cloud lets you trade _some_ operational concerns
             | for higher costs, but not all of them.
             | 
             | [0] If you can assume servers run by somebody who doesn't
             | know how to do it properly, obviously I can assume the same
             | about cloud configuration. Have fun with your leaked API
             | keys.
        
           | throwaway2016a wrote:
           | I've managed both data centers and cloud and IMHO, no, it is
           | not fluff. To take it in order:
           | 
           | > doesn't have security issues
           | 
           | It sure does, but the matrix of responsibility is very
           | different when it is a hosted service. Note: I am making
           | these comments about Bedrock, which is serverless not in
           | relation to EC2.
           | 
           | > It crashes
           | 
           | Absolutely, but the recovery profile is not even close to the
           | same. Unless you have a full time person with physical access
           | to your server who can go press buttons.
           | 
           | > data loss
           | 
           | I'm going to shift this one a tiny bit. What about hardware
           | loss? You need backups regardless. On the cloud when a HDD
           | dies you provision a new one. On premise you need to have the
           | replacement there and ready to swap out (unless you want to
           | wait for shipping). Same with all the other components. So
           | you basically need to buy two of everything. If you have a
           | fleet of servers that's not too bad since presumably they
           | aren't going to all fail on the same component at the same
           | time. But for a single server it is literally double the
           | cost.
           | 
           | > doesn't involve lots of administration
           | 
           | Again, this is relation to Bedrock with is a managed
           | serverless environment. So there is litterally no
           | administration aside from provisioning and securing access to
           | the resource. You'd have a point if this was running on EC2
           | or EKS but that's not what my post was about.
           | 
           | > Thinking that we don't know any better is both disingenuous
           | and a bit insulting.
           | 
           | I'm not saying cloud is perfect in any way, like all things
           | it requires tradeoff, but quite frankly I find you dismissing
           | my 25 years of experience, 1/3 of that has been working in
           | real data centers (including a top-50 internet company at the
           | time) as "fluff" as "disingenuous and a bit insulting".
        
             | amluto wrote:
             | > Unless you have a full time person with physical access
             | to your server who can go press buttons.
             | 
             | Every colo facility I've used offers "remote hands". If you
             | need a button pressed or a disk swapped, they will do it,
             | with a fee structure and response time that varies
             | depending on one's arrangement with the operator. But it's
             | generally both inexpensive and fast.
             | 
             | > What about hardware loss? You need backups regardless. On
             | the cloud when a HDD dies you provision a new one. On
             | premise you need to have the replacement there and ready to
             | swap out (unless you want to wait for shipping).
             | 
             | Two of everything may still be cheaper than expensive cloud
             | services. But there's an obvious middle ground: a service
             | contract that guarantees you spare parts and a technician
             | with a designated amount of notice. This service is widely
             | available and reasonably priced. (Don't believe the listed
             | total prices on the web sites of big name server vendors --
             | they negotiate substantial discounts, even in small
             | quantities.)
        
               | throwaway2016a wrote:
               | > But it's generally both inexpensive and fast.
               | 
               | I guess inexpensive is relative. I've been on cloud for a
               | while so I'm not sure what the going rates are for
               | "remote hands" and most of my experience is with on-
               | premise vs co-lo.
               | 
               | > Two of everything may still be cheaper than expensive
               | cloud services.
               | 
               | That is true. Everything has tradeoffs. Though in the OPs
               | case I think the math is relatively clear. With Open AIs
               | pricing he calculated the break even at 5 years just for
               | the hardware and electricity. Assuming that calculation
               | is right, two of everything would up that to 7+ years, at
               | which point... a lot can happen in 7 years.
        
               | amluto wrote:
               | > I guess inexpensive is relative. I've been on cloud for
               | a while so I'm not sure what the going rates are for
               | "remote hands" and most of my experience is with on-
               | premise vs co-lo.
               | 
               | At a low end facility, I've usually paid between $0 and
               | $50 per remote hands incident. The staff was friendly and
               | competent, and I had no complaints. The price list goes a
               | bit higher, but I haven't needed those services at that
               | facility.
        
           | yolovoe wrote:
           | You could have gotten rid of the middle paragraph. It's not
           | fluff. These are valid technical points. Issues most
           | companies would rather (reasonably) pay to not have to deal
           | with.
           | 
           | And do you really think you can offer better security and
           | uptime than AWS? Not impossible but very expensive if you're
           | managing everything from your own hardware. You clearly
           | vastly underestimate all that AWS is taking care of.
        
       | AaronFriel wrote:
       | These costs don't line up with my own experiments using vLLM on
       | EKS for hosting small to medium sized models. For small (under
       | 10B parameters) models on g5 instances, with prefix caching and
       | an agent style workload with only 1 or a small number of turns
       | per request, I saw on the order of tens of thousands of
       | tokens/second of prefill (due to my common system prompts) and
       | around 900 tokens/second of output.
       | 
       | I think this worked out to around $1/million tokens of output and
       | orders of magnitude less for input tokens, and before reserved
       | instances or other providers were considered.
        
         | veryrealsid wrote:
         | Interesting, I think how the model runs makes a big difference
         | and I plan to re-run this experiment with different models and
         | different ways of running the model.
        
       | winddude wrote:
       | does aws not have lower vcpu and memory instances with multiple
       | T4s? because with 192gbs of memory and 24 cores, you're paying
       | for a ton of resources you won't be using if you're only running
       | inference.
        
       | kiratp wrote:
       | 3 year commit pricing with Jetstream + Maxtext on TPU v5e is
       | $0.25 per million tokens.
       | 
       | On demand pricing put it at about $0.45 per million tokens.
       | 
       | Source: We use TPUs at scale at https://osmos.io
       | 
       | Google Next 2024 session going into detail:
       | https://www.youtube.com/watch?v=5QsM1K9ahtw
       | 
       | https://github.com/google/JetStream
       | 
       | https://github.com/google/maxtext
        
         | qihqi wrote:
         | For pytorch users: checkout the sister project:
         | https://github.com/google/jetstream-pytorch/blob/main/benchm...
        
       | yousif_123123 wrote:
       | deepinfra.com hosts Llama 3 8b for 8 cents per 1m tokens. I'm not
       | sure it's the cheapest but it's pretty cheap. There may be even
       | cheaper options.
       | 
       | (Haven't used it in production, thinking to use it for side
       | projects).
        
       | xmonkee wrote:
       | Does anyone know the impact of the prompt size in terms of
       | throughput? If I'm only generating 10 tokens, does it matter if
       | my initial prompt is 10 tokens or 8000 tokens? How much does it
       | matter?
        
       | vinni2 wrote:
       | Ggml Q8 models on ollama can run on much cheaper hardware without
       | losing much performance.
        
       | ilaksh wrote:
       | Kind of a ridiculous approach, especially for this model. Use
       | together.ai, fireworks.ai, RunPod serverless, any serverless. Or
       | use ollama with the default quantization, will work on many home
       | computers, including my gaming laptop which is about 5 years old.
        
       | angoragoats wrote:
       | Agreed with the sentiments here that this article gets a lot of
       | the facts wrong, and I'll add one: the cost for electricity when
       | self-hosting is dramatically lower than the article says. The
       | math assumes that each of the Tesla T4s will be using their full
       | TDP (70W each) 24 hours a day, 7 days a week. In reality, GPUs
       | throttle down to a low power state when not in use. So unless
       | you're conversing with your LLM literally 24 hours a day, it will
       | be using dramatically less power. Even when actively doing
       | inference, my GPU doesn't quite max out its power usage.
       | 
       | Your self-hosted LLM box is going to use maybe 20-30% of the
       | power this article suggests it will.
       | 
       | Source: I run LLMs at home on a machine I built myself.
        
       | baobabKoodaa wrote:
       | If we care about cost efficiency when running LLMs, the most
       | important things are:
       | 
       | 1. Don't use AWS, because it's one of the most expensive cloud
       | providers
       | 
       | 2. Use quantized models, because they offer the best output
       | quality per money spent, regardless of the budget
       | 
       | This article, on the other hand, focuses exclusively on running
       | an unquantized model on AWS...
        
       | johnklos wrote:
       | Self hosting means hosting it yourself, not running it on Amazon.
       | I think the distinction the author intends to make is between
       | running something that can't be hosted elsewhere, like ChatGPT,
       | versus running Llama-3 yourself.
       | 
       | Overlooking that, the rest of the article feels a bit strange.
       | Would we really have a use case where we can make use of those
       | 157 million tokens a month? Would we really round $50 of energy
       | cost to $100 a month? (Granted, the author didn't include power
       | for the computer) If we buy our own system to run, why would we
       | need to "scale your own hardware"?
       | 
       | I get that this is just to give us an idea of what running
       | something yourself would cost when comparing with services like
       | ChatGPT, but if so, we wouldn't be making most of the choices
       | made here such as getting four NVIDIA Tesla T4 cards.
       | 
       | Memory is cheap, so running Llama-3 entirely on CPU is also an
       | option. It's slower, of course, but it's infinitely more
       | flexible. If I really wanted to spend a lot of time tinkering
       | with LLMs, I'd definitely do this to figure out what I want to
       | run before deciding on GPU hardware, then I'd get GPU hardware
       | that best matches that, instead of the other way around.
        
         | williamstein wrote:
         | > Self hosting means hosting it yourself, not running it on
         | Amazon.
         | 
         | No. I googled "self hosting", read the first few definitions,
         | and they agree with the article, not you. E.g., wikipedia --
         | https://en.wikipedia.org/wiki/Self-hosting_(web_services)
        
           | johnklos wrote:
           | The very first definition from the link you provide is:
           | 
           | > Self-hosting is the practice of running and maintaining a
           | website or service using a private web server, instead of
           | using a service outside of someone's own control.
           | 
           | Hosting anything on Amazon is not "using a private web
           | server" and is the very definition of using "a service
           | outside of someone's own control".
           | 
           | The fact that the rest of the article talks about "enabled
           | users to run their own servers on remote hardware or virtual
           | machines" is just wrong. It's not "their own servers", and we
           | don't have "more control over their data, privacy" when it's
           | literally in the possession of others.
        
             | Majestic121 wrote:
             | The second sentence is however :
             | 
             | > The practice of self-hosting web services became more
             | feasible with the development of cloud computing and
             | virtualization technologies, which enabled users to run
             | their own servers on remote hardware or virtual machines.
             | The first public cloud service, Amazon Web Services (AWS),
             | was launched in 2006, offering Simple Storage Service (S3)
             | and Elastic Compute Cloud (EC2) as its initial products.[3]
             | 
             | The mystery deepens
        
               | chasd00 wrote:
               | I hate when terms get diluted like this. "self hosted",
               | to me, means you own the physical machine. This reminds
               | of how "air-gapped server" now means a route
               | configuration vs an actual gap of air, no physical
               | connection, between two networks. It really confuses
               | things.
        
           | carom wrote:
           | I would say that is "cloud hosted", which is obviously very
           | expensive compared to running on hardware you own (assuming
           | you own a computer and a GPU). That was the comparison I was
           | interested in, the fact that renting a computer is more
           | expensive than the OpenAI API is not a surprising result.
        
       | mark_l_watson wrote:
       | Until January this year I mostly used Google Colab for both LLMs
       | and deep learning projects. In January I spent about $1800
       | getting Apple Silicon M2Pro 32G. When I first got it, I was only
       | so-so happy with the models I could run. Now I am ecstatically
       | happy with the quality of the models I can run on this hardware.
       | 
       | I sometimes use Groq Llama3 APIs (so fast!) or OpenAI APIs, but I
       | mostly use my 32G M2 system.
       | 
       | The article calculates cost of self-hosting, but I think it is
       | also good taking into account how happy I am self hosting on my
       | own hardware.
        
       | rfw300 wrote:
       | I agree with most of the criticisms here, and will add on one
       | more: while it is generally true that you can't beat "serverless"
       | inference pricing for LLMs, production deployments often depend
       | on fine-tuned models, for which these providers typically charge
       | much more to host. That's where the cost (and security, etc.)
       | advantage for running on dedicated hardware comes in.
        
       | cloudking wrote:
       | What do you use it for? What problems does it solve?
        
       | k__ wrote:
       | Half-OT: can I shard Llama3 and run it on multiple wasm
       | processes?
        
       | Havoc wrote:
       | >initial server cost of $3,800
       | 
       | Not following?
       | 
       | Llama 8B is like 17ish gigs. You can throw that onto a single
       | 3090 off ebay. 700 for the card and another 500 for some 2nd hand
       | basic gaming rig.
       | 
       | Plus you don't need a 4 slot PCIE mobo. Plus it's a gen4 pcie
       | card (vs gen3). Plus skipping the complexity of multi-GPU. And
       | wouldn't be surprised if it ends up faster too (everything in one
       | GPU tends to be much faster in my experience, plus 3090 is just
       | organically faster 1:1)
       | 
       | Or if you're feeling extra spicy you can do same on a 7900XTX
       | (inference works fine on those & it's likely that there will be
       | big optimisation gains in next months).
        
         | Sohcahtoa82 wrote:
         | > Llama 8B is like 17ish gigs. You can throw that onto a single
         | 3090 off ebay
         | 
         | Someone correct me if I'm wrong, but I've always thought you
         | needed enough VRAM to have at least double the model size so
         | that the GPU has enough VRAM for the calculated values from the
         | model. So that 17 GB model requires 34 GB of RAM.
         | 
         | Though you can quantize to fp8/int8 with surprisingly little
         | negative effect and then run that 17 GB model with 17 GB of
         | VRAM.
        
           | jokethrowaway wrote:
           | No, you don't need that much
           | 
           | Here is a calculator (if you have a GPU you want to use EXL2,
           | otherwise GGUF) https://huggingface.co/spaces/NyxKrage/LLM-
           | Model-VRAM-Calcul...
           | 
           | Also model quantisation goes a long way with surprisingly
           | little loss in quality.
        
       | sgt101 wrote:
       | Running 13b code llama on my m1 macbook pro as I type this...
        
       | badgersnake wrote:
       | I've used llama3 on my work laptop with ollama. It wrote an
       | amazing pop song about k-nearest neighbours in the style of PJ
       | and Duncan's 'Let's Get Ready to Rhumble' called 'Let's Get Ready
       | to Classify' For everything else it's next to useless.
        
       | forrest2 wrote:
       | A single synchronous request is not a good way to understand cost
       | here unless your workload is truly singular tiny requests.
       | Chatgpt handles many requests in parallel and this article's 4
       | GPU setup certainly can handle more too.
       | 
       | It is miraculous that the cost comparison isn't worse given how
       | adversarial this test is.
       | 
       | Larger requests, concurrent requests, and request queueing will
       | drastically reduce cost here.
        
       | axegon_ wrote:
       | Up until not too long ago I assumed that self-hosting an llm
       | would come at an outrageous cost. I have a bunch of problems with
       | LLM's in general. The major one is that all LLMs(even openAI)
       | will produce output which will give anyone a great sense of
       | confidence, only to be later slapped across the face with the
       | harsh reality-for anything involving serious reasoning, chances
       | are the response you got was at large bullshit. The second one is
       | that I do not entirely trust those companies with my data, be it
       | OpenAI, Microsoft or Github or any other.
       | 
       | That said, a while ago there was this[1] thread on here which
       | helped me snatch a brand new, unboxed p40 for peanuts. Really,
       | the cost was 2 or 3 jars of good quality peanut butter. Sadly
       | it's still collecting dust since although my workstation can
       | accommodate it, cooling is a bit of an issue - I 3D printed a
       | bunch of hacky vents but I haven't had the time to put it all
       | together.
       | 
       | The reason why I went this road was phi-3, which blew me away by
       | how powerful, yet compact it is. Again, I would not trust it with
       | anything big, but I have been using it for sifting through a
       | bunch of raw, unstructured text and extract data from it and it's
       | honestly done wonders. Overall, depending on your budget and your
       | goal, running an llm in your home lab is a very appealing idea.
       | 
       | [1] https://news.ycombinator.com/item?id=39477848
        
       | yieldcrv wrote:
       | this is not what I consider self hosting but ok
       | 
       | I would like to compare the costs vs hardware on prem, so this
       | helps with one side of the equation
        
       | agcat wrote:
       | This is a good way to do math. But honestly, how many products
       | actually have 100% utilisation. I did some math a few months ago
       | but mostly on the basis of active users, on what would be the %
       | difference if you have 1k to 10K users/mo. You can run this as
       | low as $0.3K/mo on Serverless GPUs and $0.7K/mo on EC2.
       | 
       | The pricing is outdated now.
       | 
       | Here is the piece -https://www.inferless.com/learn/unraveling-
       | gpu-inference-cos...
        
       | wesleyyue wrote:
       | Surprised no comments are pointing out that the analysis is
       | pretty far off simply due to the fact that the author runs with
       | batch size of 1. The cost being 100x - 1000x what API providers
       | are charging should be a hint that something is seriously off,
       | even if you expect some of these APIs to be subsidized.
        
       | segmondy wrote:
       | I own an 8 GPU cluster that I built for super cheap < $4,000.
       | 180gb vram, 7 24gb + 1 24gb. There are tons of models that I run
       | that's not hosted by any provider. The only way to run it is to
       | host myself. Furthermore, the author has 39 tokens in 6 seconds.
       | For llama3-8b, I get almost 80 tk/s and if parallel, can easily
       | get up to 800 tk/s. Most users at home infer only one at a time
       | because they are doing chat or role play. If you are doing more
       | serious work, you will most likely have multiple inference
       | running at once. When working with smaller models, it's not
       | unusual to have 4-5 models loaded at once with multiple inference
       | going. I have about 2tb of models downloaded, I don't have to
       | shuffle data back and forth to the cloud, etc. To each their own,
       | the author's argument is made today by many on why you should
       | host in the cloud. Yet if you are not flush with cash and a
       | little creative, it's far cheaper to run your own server than in
       | the cloud.
       | 
       | To run llama-3 8b. A new $300 3060 12gb will do, it will load
       | fine in Q8 gguf. If you must load in fp16 and cash is a problem a
       | $160 P40 will do. If performance is desired a used 3090 for ~$650
       | will do.
        
         | kennethwolters wrote:
         | I am looking into renting Hetzner GEX44 dedicated server to run
         | a couple models on with Ollama. I haven't done the arithmetics
         | yet but I wouldn't be surprised to see a 100x cost-decrease
         | compared to OpenAI APIs (granted the models I'll run on the
         | GEX44 machines will be less powerful)
        
         | ekkyv6 wrote:
         | What kind of setup were you able to do for so cheap? I'd love
         | to be able to do more locally. I have access to a single RTX
         | A5000 at work, but it is often not enough for what I'm wanting
         | to do, and I end up renting cloud GPU.
        
       | jokethrowaway wrote:
       | Yeah, or you can get a gpu server with 20GB VRAM on hetzner for
       | ~200 EUR per month. Runpod and DigitalOcean are also quite
       | competitive on prices if you need a different GPU.
       | 
       | AWS is stupidly expensive.
        
         | hereonout2 wrote:
         | Expensive in general but combine some decent tooling and spot
         | instances and it can be insanely cheap.
         | 
         | The latest Nvidia L4 GPUs (24GB) instances are currently less
         | than 15c p/h spot.
         | 
         | T4s are around 20c per hour spot, though they are smaller and
         | slower.
         | 
         | I've been provisioning hundreds of these at a time to do large
         | batch jobs at a fraction of the price of commercial solutions
         | (i.e. 10-100x cheaper).
         | 
         | Any problem that fits in a smaller GPU and can be expressed as
         | a batch job using spot instances can be done very cheaply on
         | AWS.
        
       | waldrews wrote:
       | Hetzner GPU servers at $200/month for an RTX 4000 with 20GB seem
       | competitive. Anyone have experience with what kind of token
       | throughput you could get with that?
        
       | cheptsov wrote:
       | With dstack you can either utilize multiple affordable cloud GPU
       | providers at once to get the cheapest GPU offer or also use an
       | own cluster of on-prem servers. Dstack supports both altogether.
       | Disclaimer: I'm a core contributor to dstack
        
       | visarga wrote:
       | I just bought a $1099 MacBook Air M3, I get about 10 tokens/s for
       | a q5 quant. Doesn't even get hot, and I can take it with me on
       | the plane. It's really easy to install ollama.
        
       ___________________________________________________________________
       (page generated 2024-06-14 23:01 UTC)