[HN Gopher] Cost of self hosting Llama-3 8B-Instruct
___________________________________________________________________
Cost of self hosting Llama-3 8B-Instruct
Author : veryrealsid
Score : 192 points
Date : 2024-06-14 15:30 UTC (7 hours ago)
(HTM) web link (blog.lytix.co)
(TXT) w3m dump (blog.lytix.co)
| philipkglass wrote:
| _Instead of using AWS another approach involves self hosting the
| hardware as well. Even after factoring in energy, this does
| dramatically lower the price._
|
| _Assuming we want to mirror our setup in AWS, we'd need 4x
| NVidia Tesla T4s. You can buy them for about $700 on eBay.
|
| Add in $1,000 to setup the rest of the rig and you have a final
| price of around:
|
| $2,800 + $1,000 = $3,800_
|
| This whole exercise assumes that you're using the Llama 3 8b
| model. At full fp16 precision that will fit in one 3090 or 4090
| GPU (the int8 version will too, and run faster, with very little
| degradation.) Especially if you're willing to buy GPU hardware
| from eBay, that will cost significantly less.
|
| I have my home workstation with a 4090 exposed as a vLLM service
| to an AWS environment where I access it via reverse SSH tunnel.
| causal wrote:
| Came here to say this. No way you need to spend more than $1500
| to run L3 8B at FP16. And you can get near-identical
| performance at Q8 for even less.
|
| I'm guessing actual break-even time is less than half that, so
| maybe 2 years.
| causal wrote:
| Furthermore, the AWS estimates are also really poorly done.
| Using EKS this way is really inefficient, and a better
| comparison would be AWS Bedrock Haiku which averages $0.75/M
| tokens: https://aws.amazon.com/bedrock/pricing/
|
| This whole post makes OpenAI look like a better deal than it
| actually is.
| mrinterweb wrote:
| I was getting that sense too. It would not be difficult to
| build a desktop machine with a 4090 for around $2500. I run
| Llama-3 8b on my 4090, and it runs well. Plus side is I can
| play games with the machine too :)
| shostack wrote:
| How is inference latency for coding use cases on a local 3090
| or 4090 compared to say, hitting the GPT-4o API?
| whereismyacc wrote:
| I assume the characteristics would be pretty different, since
| your local hardware can keep the context loaded in memory,
| unlike APIs which I'm guessing have to re-load it for each
| query/generation?
| christina97 wrote:
| If you integrate with existing tooling, it won't do this
| optimization. Unless of course you really go crazy with
| your setup.
| moffkalast wrote:
| Setting one launch flag on llama.cpp server hardly
| qualifies as going crazy with one's setup.
| kiratp wrote:
| Nvidia EULA prevents you from using consumer gaming GPUs in
| datacenters so 4xxx cards are a non-starter for any service
| usecases
|
| EDIT: TOS -> EULA per comments below
| nubinetwork wrote:
| That never stopped the crypto farmers...
| byteknight wrote:
| They also weren't selling the usage of the cards.
| oneshtein wrote:
| Nvidia terms of what?
| codetrotter wrote:
| Parent commenter used the wrong word. It's the EULA that
| prevents it.
|
| Regardless, it is true that it is a problem.
|
| https://www.reddit.com/r/MachineLearning/comments/ikrk4u/d_
| c...
| J_Shelby_J wrote:
| What about on prem? Like, my small business needs an LLM. Can
| I put a 3090 in a box in a closet?
|
| What if I'm a business and I'm selling LLMs in a box for you
| to put on a private network?
|
| What constitutes a data center according to the ToS? Is it
| enforceable if you never agree to the ToS (buying through
| eBay?)
| kiratp wrote:
| By using the drivers you agree to their TOS. So yes, it
| applies even on your private network.
| swatcoder wrote:
| The customer limitation described in the EULA is exactly
| this:
|
| > No Datacenter Deployment. The SOFTWARE is not licensed
| for datacenter deployment, except that blockchain
| processing in a datacenter is permitted.
|
| - https://www.nvidia.com/content/DriverDownloads/licence.
| php?l...
|
| There's no further elaboration on what "datacenter" means
| here, and it's a fair argument to say that a closet with
| one consumer-GPU-enriched PC is not a "datacenter
| deployment". The odds that Nvidia would pursue a claim
| against an individual or small business who used it that
| way is infinitesimal.
|
| So both the ethical issue (it's a fair-if-debatable read
| of the clause) and the practical legal issue (Nvidia
| wouldn't bother to argue either way) seem to say one
| needn't worry about.
|
| The clause is there to deter at-scale commercial service
| providers from buying up the consumer card market.
| light_hue_1 wrote:
| Don't listen to this person. They have no idea what they're
| talking about.
|
| No one cares about this TOS provision. I know both startups
| and large businesses that violate it as well as industry
| datacenters and academic clusters. There are companies that
| explicitly sell you hardware to violate it. Heck, Nvidia
| will even give you a discount when you buy the hardware to
| violate it in large enough volume!
|
| You do you.
| wongarsu wrote:
| In a previous AI wave hosters like OVH and Hetzner
| started offering servers with GTX 1080 at prices other
| hosters with datacenter-grade GPUs couldn't possibly
| compete with - and VRAM wasn't as big of a deal back
| then. That's who this clause targets.
|
| If you don't rent our servers or VMs Nvidia doesn't care.
| They aren't Oracle.
| giancarlostoro wrote:
| It's not in a data center, it's in his home.
| badgersnake wrote:
| How would they even know?
| jtriangle wrote:
| There are no nvidia police, they literally cannot stop you
| from doing this.
| choppaface wrote:
| Yeah but this article is terrible. First it talks about naively
| copy-pasting code to get "a seeming 10x speed-up" and then
| "This ended up being incorrect way of calculating the tokens
| used."
|
| I would not bank on anything in this article. It might as well
| have been written by a tiny Llama model.
| czhu12 wrote:
| I do the same thing with cloudflare tunnels and managing the
| cloudflare tunnel process and the llama.cpp server with systemd
| on my home internet.
|
| Have a 13B running on a 3070 with 16 gpu layers and the rest
| running off CPU.
|
| Performs okay, but way cheaper than renting a GPU on the cloud.
| logtrees wrote:
| Whoa, so you have code running in AWS making use of your local
| hardware via what is called a reverse SSH tunnel? I will have
| to look into how that works, that's pretty powerful if so. I
| have a mac mini that I use for builds and deploys via FTP/SFTP
| and was going to look into setting up "messaging" via that
| pipeline to access local hardware compute through file messages
| lol, but reverse SSH tunnel sounds like it'll be way better for
| directly calling executables rather than needing to parse
| messages from files first.
| brrrrrm wrote:
| I use my mac mini exactly as described by the parent post but
| using ollama as the server. Super easy setup and obv chatgpt
| can guide you through it
| logtrees wrote:
| Unfortunately my mac mini isn't beefy enough to run ollama,
| it's the base model m1 from a couple years ago lol. But
| it's very powerful for builds, deploys, and some
| computation via scripts. Now I'm curious to check out how
| much memory the newest ones support for potentially using
| ollama on it haha. Thanks!
| brrrrrm wrote:
| Mine is also an m1. Just use llama3, its 8b quantized by
| default
| logtrees wrote:
| I will try it out, curious to see how it will work with
| 8gb of memory haha. Thanks for the heads up!
| apnew wrote:
| Do you happen to have any handy guides/docs/references
| for absolute beginners to follow?
| paulmd wrote:
| Ollama is not as powerful as llama.cpp or raw pytorch,
| but it is almost zero effort to get started.
|
| brew install ollama; ollama serve; ollama pull llama3:
| 8b-v2.9-q5_K_M; ollama run llama3: 8b-v2.9-q5_K_M
|
| https://ollama.com/library/dolphin-llama3:8b-v2.9-q5_K_M
|
| (It may need to be Q4 or Q3 instead of Q5 depending on
| how the RAM shakes out. But the Q5_K_M quantization
| (k-quantization is the term) is generally the best
| balance of size vs performance vs intelligence if you can
| run it, followed by Q4_K_M. Running Q6, Q8, or fp16 is of
| course even better but you're nowhere near fitting that
| on 8gb.)
|
| https://old.reddit.com/r/LocalLLaMA/comments/1ba55rj/over
| vie...
|
| Dolphin-llama3 is generally more compliant and I'd
| recommend that over just the base model. It's been fine-
| tuned to filter out the dumb "sorry I can't do that"
| battle, and it turns out this also increases the quality
| of the results (by limiting the space you're generating,
| you also limit the quality of the results).
|
| https://erichartford.com/uncensored-models
|
| https://arxiv.org/abs/2308.13449
|
| Most of the time you will want to look for an "instruct"
| model, if it doesn't have the instruct suffix it'll
| normally be a "fill in the blank" model that finishes
| what it thinks is the pattern in the input, rather than
| generate a textual answer to a question. But ollama
| typically pulls the instruct models into their repos.
|
| (sometimes you will see this even with instruct models,
| especially if they're misconfigured. When llama3 non-
| dolphin first came out I played with it and I'd get
| answers that looked like stackoverflow format or quora
| format responses with ""scores"" etc, either as the full
| output or mixed in. Presumably a misconfigured model, or
| they pulled in a non-instruct model, or something.)
|
| Dolphin-mixtral:8x7b-v2.7 is where things get really
| interesting imo. I have 64gb and 32gb machines and so far
| the Q6 and q4-k_m are the best options for those
| machines. dolphin-llama3 is reasonable but dolphin-
| mixtral is a richer better response.
|
| I'm told there's better stuff available now, but not sure
| what a good choice would be for for 64gb and 32gb if not
| mixtral.
|
| Also, just keep an eye on r/LocalLLaMA in general, that's
| where all the enthusiasts hang out.
| verdverm wrote:
| using Tailscale can make the networking setup much easier,
| really like their service for things like this (or curling
| another dev's local running server)
| sneak wrote:
| Look into Nebula (or Tailscale if you trust third parties). I
| have all my workstations and servers on a mesh network that
| appears as a single /24 that is end to end encrypted,
| mutually authenticated and works through/behind NAT. I can
| spawn a vhost on any server that reverse proxies an API to
| any port on any machine.
|
| It's been an absolute gamechanger.
| logtrees wrote:
| Whooooaaa that is mind-blowing. Thanks for sharing. <3
| elorant wrote:
| Is there any resource that goes into more detail about how
| to setup all this?
| sneak wrote:
| https://github.com/slackhq/nebula
|
| the docs are good. when creating the initial CA make
| absolutely sure you set the CA expiration to 10-30 years,
| the default is 1 which means your whole setup explodes in
| a year without warning.
| aborsy wrote:
| Why do you have to trust a third party?
|
| It's end to end encrypted, and with tail lock enabled,
| nodes can not be added without user's permission.
| 1oooqooq wrote:
| why either of these over plain wireguard if you're not
| provisioning accounts?
| sneak wrote:
| Wireguard doesn't do nat punching and is not mesh, it's
| p2p only.
|
| totally different use case.
| favflam wrote:
| You can also check if you have ipv6. I have tried both, but
| prefer directly connecting home.
| hehdhdjehehegwv wrote:
| I dropped $5k on an A6000 and I can run llama3:70b day and
| night for the price of my electricity bill.
|
| I've gone through hundreds of millions, maybe billions, of
| tokens in the past year.
|
| This article is just "cloud is expensive" 101. Nothing new.
| brcmthrowaway wrote:
| Hows your ROI?
| hehdhdjehehegwv wrote:
| Absolutely phenomenal.
| logicallee wrote:
| Super cool, thanks for sharing. Do you mind sharing what you
| used the hundreds of millions (or billions) of tokens on?
| hereonout2 wrote:
| I've worked professionally over the last 12 months hosting
| quite a few foundation models and fine tuned LLMs on our own
| hardware, aws + azure vms and also a variety of newer
| "inference serving" type services that are popping up
| everywhere.
|
| I don't do any work with the output, I'm just the MLOps guy
| (ahem, DevOps).
|
| You mention expense but on a purely financial basis I find
| any of these hosted solutions really hard to justify against
| GPT 3.5 turbo prices, including building your own rig. $5k +
| electricity is loads of 3.5 Turbo tokens.
|
| Of course none of the data scientists or researchers I work
| with want to use that though - it's not their job to host
| these things or worry about the costs.
| elorant wrote:
| Is this at 4-bit quantization? And how many tokens per second
| is the output?
| EvgeniyZh wrote:
| 1B of tokens for Gemini Flash (which is on par with
| llama3-70b in my experience or even better sometimes) with
| 2:1 input-output would cost ~600 bucks (ignoring the fact
| they offer 1M tokens a day for free now). Ignoring
| electricity you'd break even in >8 years. You can find
| llama3-70b for ~same prices if you're interested in the
| specific model.
| cootsnuck wrote:
| Yea, for any hobbyist, indie developer, etc. I think it'd be
| ridiculous to not first try running one of these smaller (but
| decently powerful) open source models on your own hardware at
| home.
|
| Ollama makes it dead simple just to try it out. I was
| pleasantly surprised by the tokens/sec I could get with Llama 3
| 8B on a 2021 M1 MBP. Now need to try on my gaming PC I never
| use. Would be super cool to just have a LLM server on my local
| network for me and the fam. Exciting times.
| speakspokespok wrote:
| Why did this only occur to me recently? You can selfhost a k8s
| cluster and expose the services using a $5 digital ocean
| droplet. The droplet and k8s services are point-to-point
| connected using tailscale. Performance is perfectly fine, keeps
| your skillset sharp, and you're self-hosting!
| Helithumper wrote:
| You can also just directly connect to containers using
| Tailscale if it's just for internal use. That is, having an
| internally addressable `https://container_name` on your
| tailnet per-container if you want. This way I can setup
| Immich for example and it's just on my tailnet at
| `https://immich` without the need for a reverse proxy, etc...
|
| https://tailscale.com/blog/docker-tailscale-guide
| SparkyMcUnicorn wrote:
| And you can use Tailscale Funnel to serve it publicly. No
| need to pay for a cloud instance.
|
| https://tailscale.com/kb/1223/funnel
| liquidise wrote:
| Great mix of napkin math and proper analysis, but what strikes me
| most is how cheap LLM access is. For it being relatively bleeding
| edge, us splitting hairs on < $20/M tokens is remarkable itself,
| and something tech people should be thrilled about.
| refulgentis wrote:
| Smacks of the "starving kids in Africa" fallacy, you could make
| the same argument that tech people should be thrilled for
| current thing being available at $X for X =
| $2/$20/$200/$2000...
| theogravity wrote:
| The energy costs in the bay area are double the reported 24c
| cost, so energy alone would be around $100-ish a month instead of
| $50-ish.
| pkaye wrote:
| Unless you are in Santa Clara with Silicon Valley Power rates.
|
| https://www.siliconvalleypower.com/residents/rates-and-fees
| veryrealsid wrote:
| Yeah agreed, some of the areas we have access to were 16c (PA)
| and up to 24c (NYC), doubled that cost in the analysis because
| of things like this
| angoragoats wrote:
| Except that the article assumes that the GPUs would be using
| their max TDP all the time, which is incorrect. GPUs will
| throttle down to 5-20w (depending on the specific GPU). So your
| actual power consumption is going to be much, much lower,
| unless you're literally using your LLM 24/7.
| causal wrote:
| No way you need $3,800 to run an 8B model. 3090 and a basic rig
| is enough.
|
| That being said, the difference between OpenAI and AWS cost ($1
| vs $17) is huge. Is OpenAI just operating at a massive loss?
|
| Edit: Turns out AWS is actually cheaper if you don't use the
| terrible setup in this article, see comments below.
| throwup238 wrote:
| AWS's pricing is just ridiculous. Their 1-year reserve pricing
| for an 8x H100 or A100 instance (p4/p5) costs just as much as
| buying the machine outright with tens of thousands left over
| for the NVIDIA enterprise license and someone to manage them
| (per instance!). Their on demand pricing is even more insane -
| they're charging $3.x/hr for six year old cards.
| readams wrote:
| What about the cost of the power and cooling to run the
| machine (a lot!), and the staff to keep it running?
| throwup238 wrote:
| That's why I said "and someone to manage them". The
| difference is in the tens of thousands of dollars _per
| instance_. The savings from even a dozen instances is
| enough to pay for someone to manage them full time, and
| that 's just for the first year. Year 2 and 3 you're saving
| six figures per instance so you'd be able to afford one
| person per machine to hand massage them like some fancy
| kobe beef.
|
| A100 TDP is 400W so assuming 4kW for the whole machine,
| that's a little more than $5k/year at $0.15/kWh. Again, the
| difference is in the tens of thousands _per instance_. Even
| at 50% utilization over three years, if you need more than
| a dozen machines it 's much cheaper to buy them outright,
| especially on credit.
| refulgentis wrote:
| I mean, no, I came to scan the comments quick after reading
| because there's a lot of bad info you can walk away with from
| the post, it's sort of starting from scratch on hosting LLMs
|
| If you keep reading past there, they get it down significantly.
| The 8 tkn/s number AWS was evaluated on is really funny, that's
| about what you'd get on last year's iphone and it's not because
| apples special, it's because theres barely any reasonable
| optimization being done here. No batching, float32 weights (8
| bit is guaranteed indistinguishable from 32 bit, 5 bit tests as
| definitely indistinguishable in blind tests, 4 bit arguably is
| indistinguishable)
| causal wrote:
| You're right. In fact, using EKS at all is silly when AWS
| offers their Bedrock service with Claude Haiku (rated #19 on
| Chat Arena vs. ChatGPT3.5-Turbo at #37) for a much lower cost
| of $0.75/M tokens (averaging input and output like OP
| does)[0].
|
| So in reality AWS is cheaper for a much better model if you
| don't go with a wildly suboptimal setup.
|
| [0] https://aws.amazon.com/bedrock/pricing/
| throwaway240403 wrote:
| I thought it was generally known they were operating at a loss?
|
| even with the subs and api charges, they still let people use
| chatGPT for free with no monetization options. Sure they are
| collecting the data for training, but that's hard to quantify
| the value of.
| jezzarax wrote:
| llama.cpp + llama-3-8b in Q8 run great on a single T4 machine.
| Cannot remember the TPS I got there, but it was much above 6
| mentioned in the article.
| veryrealsid wrote:
| Interesting, I got very different results depending on how I
| ran the model, will definitely give this a try!
|
| edit: Actually could you share how long it took to make a
| query? One of our issues is we need it to respond in a fast
| time frame
| jezzarax wrote:
| I checked some logs from my past experiments, the decoding
| went for about 400 tps over a ~3k token query, so about 7
| seconds to process it, and then the generation speed was
| about 28 tokens.
| throwup238 wrote:
| The T4 is a six year old card. A much better comparison would be
| a 3090, 4090, A10, A100, etc.
| michaelmior wrote:
| There's also the option of platforms such as BentoML (I have no
| affiliation) that offer usage-based pricing so you can at least
| take the 100% utilization assumption off the table. I'm not sure
| how the price compares to EKS.
|
| https://www.bentoml.com/
| barbegal wrote:
| There's some dodgy maths
|
| >( 100 / 157,075,200 ) * 1,000,000 = $0.000000636637738
|
| Should be $0.64 so still expensive
| jasonjmcghee wrote:
| being 6 orders of magnitude off in your cost calculation isn't
| great.
|
| groq costs about that for llama 3 70b (which is a monumentally
| better model) and 1/10th of that for llama 3 8b
| pants2 wrote:
| Groq doesn't currently have a paid API that one can sign up
| for.
| jasonjmcghee wrote:
| Yup. True. Should say "will" - currently free but heavily
| rate-limited. Together AI looks to be about $0.30 / 1M
| tokens, as another price comparison. Which you can pay for.
| gradus_ad wrote:
| I wonder how long NVIDIA can justify its current market cap once
| people realize just how cheap it is to run inference on these
| models given that LLM performance is plateauing, LLM's as a whole
| are becoming commoditized, and compute demand for training will
| drop off a cliff sooner than people expect.
| nextworddev wrote:
| It's actually about training, not inference. You can't do
| training on commodity gpus but yeah once someone figures that
| out, nvdia could crash
| gradus_ad wrote:
| I know, my point is that when training demand decreases
| people will be realize that inference does not make up the
| difference
| nextworddev wrote:
| Yeah the big question I'm struggling with is exactly when
| training demand will fall if at all
| sroussey wrote:
| Every research lab is focused on new architectures that
| would reduce training costs.
| nextworddev wrote:
| Yeah we need essentially hadoop for llm training
| amluto wrote:
| Nvidia doesn't obviously have a strong inference play right
| now for a widely-deployed small model. For a model that
| really needs a 4090, maybe. But for a model that can run on a
| Coral chip or an M1/M2/M3 or whatever Intel or AMD's latest
| little AI engines can do? This market has plenty of players,
| and Nvidia doesn't seem to be anywhere near the lead except
| insofar as it's a little bit easier to run the software on
| CUDA.
| smokel wrote:
| As someone else points out, training is slightly more involved,
| but I also find that these smaller models are next to worthless
| compared to the larger ones.
|
| There are probably some situations where it suffices to use a
| small model, but for most purposes, I'd prefer to use the state
| of the art, and I'm eager for that state to progress a little
| more.
| dwaltrip wrote:
| > LLM performance is plateauing
|
| It's a wee bit early to call this. Let's see what the top labs
| release in the next year or two, yeah?
|
| GPT-4 was released only 15 months ago, which was about 3 years
| after GPT-3 was released.
|
| These things don't happen overnight, and many multi-year
| efforts are currently in the works, especially starting last
| year.
| epolanski wrote:
| I am a partial believer that the real race for many tech
| players is actually AGI and ASI later and till the problem is
| solved the hardware arm race will keep being part of it.
|
| Not only big tech is part of it but billion dollars startups
| are popping everywhere from China to US and Middle East.
| chrisdbanks wrote:
| Unless we hit another AI winter. We might get to the point
| where the hardware just can't give better returns and have to
| wait another 20 years for the next leap forward. We're still
| orders of magnitude away from the human brain.
| riku_iki wrote:
| > I wonder how long NVIDIA can justify its current market cap
| once people realize just how cheap it is to run inference on
| these models given that LLM performance is plateauing
|
| next wave driving demand can be actual new products developed
| on LLMs. There are very few usecases currently well developed
| besides chatbots, but potential is very large.
| throwaway2016a wrote:
| Llama-3 is one of the models provided by AWS Bedrock which offers
| pay as you go pricing. I'm curious how it would break down on
| that.
|
| LLAMA 8B on Bedrock is $0.40 per 1M input tokens and $0.60 per 1M
| output tokens which is a lot cheaper than OpenAI models.
|
| Edit: to add to that, as technical people we tend to discount the
| value of our own time. Bedrock and the OpenAI are both very easy
| to integrate with and get started. How long did this server take
| to build? How much time does it take to maintain and make sure
| all the security patches are applied each month? How often does
| it crash and how much time will be needed to recover it? Do you
| keep spare parts on hand / how much is the cost of downtime if
| you have to wait to get a replacement part in the mail? That's
| got to be part of the break-even equation.
| croddin wrote:
| Groq also has pay as you go pricing for llama3 8B for only
| $0.05/$0.08 that is very fast.
| sergiotapia wrote:
| Groq is actually allowing you to pay now and get real
| service?
| coder543 wrote:
| The option to pay is still listed as coming soon, but I
| also see pricing information in the settings page, so maybe
| it actually is coming somewhat sooner. I'm seeing $0.05/1M
| input and $0.10/1M output for llama3 8B, which is not
| exactly identical to what the previous person quoted.
|
| Either way, I wish Groq _would_ offer a real service to
| people willing to pay.
| croddin wrote:
| I found the .05/.08 here: https://wow.groq.com/
| refulgentis wrote:
| tl;dr: no-ish, it's getting better but still not there.
|
| I don't really get it, only thing I can surmise is it'd be
| such a no-brainer in various cases, that if they tried
| supporting it as a service, they'd have to cut users. I've
| seen multiple big media company employees begging for some
| sort of response on their discord.
| localfirst wrote:
| didn't know they finally turned on pricing plans
| VagabundoP wrote:
| Just to bounce off this a little. If you are looking to fine-
| tune using an on demand service it seems Amazon Sagemaker can
| do it at seemingly decent prices:
|
| https://aws.amazon.com/sagemaker/pricing/
|
| I'd love to hear someones experience using this as I want to
| make an RPG rules bot tied to a specific ruleset as a project
| but I fear AWS as it might bankrupt me!
| zsyllepsis wrote:
| In my experience SageMaker was relatively straightforward for
| fine-tuning models that could fit on a single instance, but
| distributed training still requires a good bit of detailed
| understanding of how things work under the covers. SageMaker
| Jumpstart includes some pretty easy out-of-the-box
| configurations for fine-tuning models that are a good
| starting point. They will incorporate some basic quantization
| and other cost-savings techniques to help reduce the total
| compute time.
|
| To help control costs, you can choose pretty conservative
| settings in terms of how long you want to let the model train
| for. Once that iteration is done and you have a model
| artifact saved, you can always pick back up and perform more
| rounds of training using the previous checkpoint as a
| starting point.
| veryrealsid wrote:
| > How long did this server take to build?
|
| About 3 days [from 0 and iterating multiple times to the final
| solution]
|
| > How much time does it take to maintain and make sure all the
| security patches are applied each month?
|
| A lot
|
| > How often does it crash and how much time will be needed to
| recover it? Do you keep spare parts on hand / how much is the
| cost of downtime if you have to wait to get a replacement part
| in the mail?
|
| All really good points, the exercise to self host is really
| just to see what is possible but completely agree that self
| hosting makes little to no sense unless you have a business
| case that can justify it.
|
| Not to mention if you sign customers with SLAs and then end up
| having downtime would put even more pressure on your self
| hosted hardware
| johnklos wrote:
| > How long did this server take to build? How much time does it
| take to maintain and make sure all the security patches are
| applied each month? How often does it crash and how much time
| will be needed to recover it? Do you keep spare parts on hand /
| how much is the cost of downtime if you have to wait to get a
| replacement part in the mail? That's got to be part of the
| break-even equation.
|
| All of these are the kinds of things that people say to non-
| technical people to try to sell cloud. It's all fluff.
|
| Do you _really_ think that cloud computing doesn 't have
| security issues, or crashes, or data loss, or that it doesn't
| involve lots of administration? Thinking that we don't know any
| better is both disingenuous and a bit insulting.
| websap wrote:
| I've managed fleets on cloud providers with over 100k
| instances, even with all the excellent features through APIs,
| managing instances can quickly get tricky.
|
| Tbh, your comment is kind of insulting and belittles how far
| we've come ahead in infrastructure management.
|
| The cloud is probably more secure than a set of janky servers
| that you have running in your basement. You can totally
| automate away 0-days, cves and get access to better security
| primitives.
| johnklos wrote:
| If my comment is insulting, I apologize. That was not my
| intention. My intention was to say that writing sales speak
| in a technical discussion is insulting to those of us who
| know better.
|
| However, you've now gone out of your way to try to be
| insulting. You know nothing about me, yet you want to
| suggest that the cloud is more secure than my servers, and
| that my servers are "janky"?
|
| Please try a little harder to engage in reasonable
| discourse.
| yjftsjthsd-h wrote:
| > The cloud is probably more secure than a set of janky
| servers that you have running in your basement.
|
| Apples/Oranges. Your janky cloud[0] is less secure than the
| servers in my basement, because I'm a mostly competent
| sysadmin. Cloud lets you trade _some_ operational concerns
| for higher costs, but not all of them.
|
| [0] If you can assume servers run by somebody who doesn't
| know how to do it properly, obviously I can assume the same
| about cloud configuration. Have fun with your leaked API
| keys.
| throwaway2016a wrote:
| I've managed both data centers and cloud and IMHO, no, it is
| not fluff. To take it in order:
|
| > doesn't have security issues
|
| It sure does, but the matrix of responsibility is very
| different when it is a hosted service. Note: I am making
| these comments about Bedrock, which is serverless not in
| relation to EC2.
|
| > It crashes
|
| Absolutely, but the recovery profile is not even close to the
| same. Unless you have a full time person with physical access
| to your server who can go press buttons.
|
| > data loss
|
| I'm going to shift this one a tiny bit. What about hardware
| loss? You need backups regardless. On the cloud when a HDD
| dies you provision a new one. On premise you need to have the
| replacement there and ready to swap out (unless you want to
| wait for shipping). Same with all the other components. So
| you basically need to buy two of everything. If you have a
| fleet of servers that's not too bad since presumably they
| aren't going to all fail on the same component at the same
| time. But for a single server it is literally double the
| cost.
|
| > doesn't involve lots of administration
|
| Again, this is relation to Bedrock with is a managed
| serverless environment. So there is litterally no
| administration aside from provisioning and securing access to
| the resource. You'd have a point if this was running on EC2
| or EKS but that's not what my post was about.
|
| > Thinking that we don't know any better is both disingenuous
| and a bit insulting.
|
| I'm not saying cloud is perfect in any way, like all things
| it requires tradeoff, but quite frankly I find you dismissing
| my 25 years of experience, 1/3 of that has been working in
| real data centers (including a top-50 internet company at the
| time) as "fluff" as "disingenuous and a bit insulting".
| amluto wrote:
| > Unless you have a full time person with physical access
| to your server who can go press buttons.
|
| Every colo facility I've used offers "remote hands". If you
| need a button pressed or a disk swapped, they will do it,
| with a fee structure and response time that varies
| depending on one's arrangement with the operator. But it's
| generally both inexpensive and fast.
|
| > What about hardware loss? You need backups regardless. On
| the cloud when a HDD dies you provision a new one. On
| premise you need to have the replacement there and ready to
| swap out (unless you want to wait for shipping).
|
| Two of everything may still be cheaper than expensive cloud
| services. But there's an obvious middle ground: a service
| contract that guarantees you spare parts and a technician
| with a designated amount of notice. This service is widely
| available and reasonably priced. (Don't believe the listed
| total prices on the web sites of big name server vendors --
| they negotiate substantial discounts, even in small
| quantities.)
| throwaway2016a wrote:
| > But it's generally both inexpensive and fast.
|
| I guess inexpensive is relative. I've been on cloud for a
| while so I'm not sure what the going rates are for
| "remote hands" and most of my experience is with on-
| premise vs co-lo.
|
| > Two of everything may still be cheaper than expensive
| cloud services.
|
| That is true. Everything has tradeoffs. Though in the OPs
| case I think the math is relatively clear. With Open AIs
| pricing he calculated the break even at 5 years just for
| the hardware and electricity. Assuming that calculation
| is right, two of everything would up that to 7+ years, at
| which point... a lot can happen in 7 years.
| amluto wrote:
| > I guess inexpensive is relative. I've been on cloud for
| a while so I'm not sure what the going rates are for
| "remote hands" and most of my experience is with on-
| premise vs co-lo.
|
| At a low end facility, I've usually paid between $0 and
| $50 per remote hands incident. The staff was friendly and
| competent, and I had no complaints. The price list goes a
| bit higher, but I haven't needed those services at that
| facility.
| yolovoe wrote:
| You could have gotten rid of the middle paragraph. It's not
| fluff. These are valid technical points. Issues most
| companies would rather (reasonably) pay to not have to deal
| with.
|
| And do you really think you can offer better security and
| uptime than AWS? Not impossible but very expensive if you're
| managing everything from your own hardware. You clearly
| vastly underestimate all that AWS is taking care of.
| AaronFriel wrote:
| These costs don't line up with my own experiments using vLLM on
| EKS for hosting small to medium sized models. For small (under
| 10B parameters) models on g5 instances, with prefix caching and
| an agent style workload with only 1 or a small number of turns
| per request, I saw on the order of tens of thousands of
| tokens/second of prefill (due to my common system prompts) and
| around 900 tokens/second of output.
|
| I think this worked out to around $1/million tokens of output and
| orders of magnitude less for input tokens, and before reserved
| instances or other providers were considered.
| veryrealsid wrote:
| Interesting, I think how the model runs makes a big difference
| and I plan to re-run this experiment with different models and
| different ways of running the model.
| winddude wrote:
| does aws not have lower vcpu and memory instances with multiple
| T4s? because with 192gbs of memory and 24 cores, you're paying
| for a ton of resources you won't be using if you're only running
| inference.
| kiratp wrote:
| 3 year commit pricing with Jetstream + Maxtext on TPU v5e is
| $0.25 per million tokens.
|
| On demand pricing put it at about $0.45 per million tokens.
|
| Source: We use TPUs at scale at https://osmos.io
|
| Google Next 2024 session going into detail:
| https://www.youtube.com/watch?v=5QsM1K9ahtw
|
| https://github.com/google/JetStream
|
| https://github.com/google/maxtext
| qihqi wrote:
| For pytorch users: checkout the sister project:
| https://github.com/google/jetstream-pytorch/blob/main/benchm...
| yousif_123123 wrote:
| deepinfra.com hosts Llama 3 8b for 8 cents per 1m tokens. I'm not
| sure it's the cheapest but it's pretty cheap. There may be even
| cheaper options.
|
| (Haven't used it in production, thinking to use it for side
| projects).
| xmonkee wrote:
| Does anyone know the impact of the prompt size in terms of
| throughput? If I'm only generating 10 tokens, does it matter if
| my initial prompt is 10 tokens or 8000 tokens? How much does it
| matter?
| vinni2 wrote:
| Ggml Q8 models on ollama can run on much cheaper hardware without
| losing much performance.
| ilaksh wrote:
| Kind of a ridiculous approach, especially for this model. Use
| together.ai, fireworks.ai, RunPod serverless, any serverless. Or
| use ollama with the default quantization, will work on many home
| computers, including my gaming laptop which is about 5 years old.
| angoragoats wrote:
| Agreed with the sentiments here that this article gets a lot of
| the facts wrong, and I'll add one: the cost for electricity when
| self-hosting is dramatically lower than the article says. The
| math assumes that each of the Tesla T4s will be using their full
| TDP (70W each) 24 hours a day, 7 days a week. In reality, GPUs
| throttle down to a low power state when not in use. So unless
| you're conversing with your LLM literally 24 hours a day, it will
| be using dramatically less power. Even when actively doing
| inference, my GPU doesn't quite max out its power usage.
|
| Your self-hosted LLM box is going to use maybe 20-30% of the
| power this article suggests it will.
|
| Source: I run LLMs at home on a machine I built myself.
| baobabKoodaa wrote:
| If we care about cost efficiency when running LLMs, the most
| important things are:
|
| 1. Don't use AWS, because it's one of the most expensive cloud
| providers
|
| 2. Use quantized models, because they offer the best output
| quality per money spent, regardless of the budget
|
| This article, on the other hand, focuses exclusively on running
| an unquantized model on AWS...
| johnklos wrote:
| Self hosting means hosting it yourself, not running it on Amazon.
| I think the distinction the author intends to make is between
| running something that can't be hosted elsewhere, like ChatGPT,
| versus running Llama-3 yourself.
|
| Overlooking that, the rest of the article feels a bit strange.
| Would we really have a use case where we can make use of those
| 157 million tokens a month? Would we really round $50 of energy
| cost to $100 a month? (Granted, the author didn't include power
| for the computer) If we buy our own system to run, why would we
| need to "scale your own hardware"?
|
| I get that this is just to give us an idea of what running
| something yourself would cost when comparing with services like
| ChatGPT, but if so, we wouldn't be making most of the choices
| made here such as getting four NVIDIA Tesla T4 cards.
|
| Memory is cheap, so running Llama-3 entirely on CPU is also an
| option. It's slower, of course, but it's infinitely more
| flexible. If I really wanted to spend a lot of time tinkering
| with LLMs, I'd definitely do this to figure out what I want to
| run before deciding on GPU hardware, then I'd get GPU hardware
| that best matches that, instead of the other way around.
| williamstein wrote:
| > Self hosting means hosting it yourself, not running it on
| Amazon.
|
| No. I googled "self hosting", read the first few definitions,
| and they agree with the article, not you. E.g., wikipedia --
| https://en.wikipedia.org/wiki/Self-hosting_(web_services)
| johnklos wrote:
| The very first definition from the link you provide is:
|
| > Self-hosting is the practice of running and maintaining a
| website or service using a private web server, instead of
| using a service outside of someone's own control.
|
| Hosting anything on Amazon is not "using a private web
| server" and is the very definition of using "a service
| outside of someone's own control".
|
| The fact that the rest of the article talks about "enabled
| users to run their own servers on remote hardware or virtual
| machines" is just wrong. It's not "their own servers", and we
| don't have "more control over their data, privacy" when it's
| literally in the possession of others.
| Majestic121 wrote:
| The second sentence is however :
|
| > The practice of self-hosting web services became more
| feasible with the development of cloud computing and
| virtualization technologies, which enabled users to run
| their own servers on remote hardware or virtual machines.
| The first public cloud service, Amazon Web Services (AWS),
| was launched in 2006, offering Simple Storage Service (S3)
| and Elastic Compute Cloud (EC2) as its initial products.[3]
|
| The mystery deepens
| chasd00 wrote:
| I hate when terms get diluted like this. "self hosted",
| to me, means you own the physical machine. This reminds
| of how "air-gapped server" now means a route
| configuration vs an actual gap of air, no physical
| connection, between two networks. It really confuses
| things.
| carom wrote:
| I would say that is "cloud hosted", which is obviously very
| expensive compared to running on hardware you own (assuming
| you own a computer and a GPU). That was the comparison I was
| interested in, the fact that renting a computer is more
| expensive than the OpenAI API is not a surprising result.
| mark_l_watson wrote:
| Until January this year I mostly used Google Colab for both LLMs
| and deep learning projects. In January I spent about $1800
| getting Apple Silicon M2Pro 32G. When I first got it, I was only
| so-so happy with the models I could run. Now I am ecstatically
| happy with the quality of the models I can run on this hardware.
|
| I sometimes use Groq Llama3 APIs (so fast!) or OpenAI APIs, but I
| mostly use my 32G M2 system.
|
| The article calculates cost of self-hosting, but I think it is
| also good taking into account how happy I am self hosting on my
| own hardware.
| rfw300 wrote:
| I agree with most of the criticisms here, and will add on one
| more: while it is generally true that you can't beat "serverless"
| inference pricing for LLMs, production deployments often depend
| on fine-tuned models, for which these providers typically charge
| much more to host. That's where the cost (and security, etc.)
| advantage for running on dedicated hardware comes in.
| cloudking wrote:
| What do you use it for? What problems does it solve?
| k__ wrote:
| Half-OT: can I shard Llama3 and run it on multiple wasm
| processes?
| Havoc wrote:
| >initial server cost of $3,800
|
| Not following?
|
| Llama 8B is like 17ish gigs. You can throw that onto a single
| 3090 off ebay. 700 for the card and another 500 for some 2nd hand
| basic gaming rig.
|
| Plus you don't need a 4 slot PCIE mobo. Plus it's a gen4 pcie
| card (vs gen3). Plus skipping the complexity of multi-GPU. And
| wouldn't be surprised if it ends up faster too (everything in one
| GPU tends to be much faster in my experience, plus 3090 is just
| organically faster 1:1)
|
| Or if you're feeling extra spicy you can do same on a 7900XTX
| (inference works fine on those & it's likely that there will be
| big optimisation gains in next months).
| Sohcahtoa82 wrote:
| > Llama 8B is like 17ish gigs. You can throw that onto a single
| 3090 off ebay
|
| Someone correct me if I'm wrong, but I've always thought you
| needed enough VRAM to have at least double the model size so
| that the GPU has enough VRAM for the calculated values from the
| model. So that 17 GB model requires 34 GB of RAM.
|
| Though you can quantize to fp8/int8 with surprisingly little
| negative effect and then run that 17 GB model with 17 GB of
| VRAM.
| jokethrowaway wrote:
| No, you don't need that much
|
| Here is a calculator (if you have a GPU you want to use EXL2,
| otherwise GGUF) https://huggingface.co/spaces/NyxKrage/LLM-
| Model-VRAM-Calcul...
|
| Also model quantisation goes a long way with surprisingly
| little loss in quality.
| sgt101 wrote:
| Running 13b code llama on my m1 macbook pro as I type this...
| badgersnake wrote:
| I've used llama3 on my work laptop with ollama. It wrote an
| amazing pop song about k-nearest neighbours in the style of PJ
| and Duncan's 'Let's Get Ready to Rhumble' called 'Let's Get Ready
| to Classify' For everything else it's next to useless.
| forrest2 wrote:
| A single synchronous request is not a good way to understand cost
| here unless your workload is truly singular tiny requests.
| Chatgpt handles many requests in parallel and this article's 4
| GPU setup certainly can handle more too.
|
| It is miraculous that the cost comparison isn't worse given how
| adversarial this test is.
|
| Larger requests, concurrent requests, and request queueing will
| drastically reduce cost here.
| axegon_ wrote:
| Up until not too long ago I assumed that self-hosting an llm
| would come at an outrageous cost. I have a bunch of problems with
| LLM's in general. The major one is that all LLMs(even openAI)
| will produce output which will give anyone a great sense of
| confidence, only to be later slapped across the face with the
| harsh reality-for anything involving serious reasoning, chances
| are the response you got was at large bullshit. The second one is
| that I do not entirely trust those companies with my data, be it
| OpenAI, Microsoft or Github or any other.
|
| That said, a while ago there was this[1] thread on here which
| helped me snatch a brand new, unboxed p40 for peanuts. Really,
| the cost was 2 or 3 jars of good quality peanut butter. Sadly
| it's still collecting dust since although my workstation can
| accommodate it, cooling is a bit of an issue - I 3D printed a
| bunch of hacky vents but I haven't had the time to put it all
| together.
|
| The reason why I went this road was phi-3, which blew me away by
| how powerful, yet compact it is. Again, I would not trust it with
| anything big, but I have been using it for sifting through a
| bunch of raw, unstructured text and extract data from it and it's
| honestly done wonders. Overall, depending on your budget and your
| goal, running an llm in your home lab is a very appealing idea.
|
| [1] https://news.ycombinator.com/item?id=39477848
| yieldcrv wrote:
| this is not what I consider self hosting but ok
|
| I would like to compare the costs vs hardware on prem, so this
| helps with one side of the equation
| agcat wrote:
| This is a good way to do math. But honestly, how many products
| actually have 100% utilisation. I did some math a few months ago
| but mostly on the basis of active users, on what would be the %
| difference if you have 1k to 10K users/mo. You can run this as
| low as $0.3K/mo on Serverless GPUs and $0.7K/mo on EC2.
|
| The pricing is outdated now.
|
| Here is the piece -https://www.inferless.com/learn/unraveling-
| gpu-inference-cos...
| wesleyyue wrote:
| Surprised no comments are pointing out that the analysis is
| pretty far off simply due to the fact that the author runs with
| batch size of 1. The cost being 100x - 1000x what API providers
| are charging should be a hint that something is seriously off,
| even if you expect some of these APIs to be subsidized.
| segmondy wrote:
| I own an 8 GPU cluster that I built for super cheap < $4,000.
| 180gb vram, 7 24gb + 1 24gb. There are tons of models that I run
| that's not hosted by any provider. The only way to run it is to
| host myself. Furthermore, the author has 39 tokens in 6 seconds.
| For llama3-8b, I get almost 80 tk/s and if parallel, can easily
| get up to 800 tk/s. Most users at home infer only one at a time
| because they are doing chat or role play. If you are doing more
| serious work, you will most likely have multiple inference
| running at once. When working with smaller models, it's not
| unusual to have 4-5 models loaded at once with multiple inference
| going. I have about 2tb of models downloaded, I don't have to
| shuffle data back and forth to the cloud, etc. To each their own,
| the author's argument is made today by many on why you should
| host in the cloud. Yet if you are not flush with cash and a
| little creative, it's far cheaper to run your own server than in
| the cloud.
|
| To run llama-3 8b. A new $300 3060 12gb will do, it will load
| fine in Q8 gguf. If you must load in fp16 and cash is a problem a
| $160 P40 will do. If performance is desired a used 3090 for ~$650
| will do.
| kennethwolters wrote:
| I am looking into renting Hetzner GEX44 dedicated server to run
| a couple models on with Ollama. I haven't done the arithmetics
| yet but I wouldn't be surprised to see a 100x cost-decrease
| compared to OpenAI APIs (granted the models I'll run on the
| GEX44 machines will be less powerful)
| ekkyv6 wrote:
| What kind of setup were you able to do for so cheap? I'd love
| to be able to do more locally. I have access to a single RTX
| A5000 at work, but it is often not enough for what I'm wanting
| to do, and I end up renting cloud GPU.
| jokethrowaway wrote:
| Yeah, or you can get a gpu server with 20GB VRAM on hetzner for
| ~200 EUR per month. Runpod and DigitalOcean are also quite
| competitive on prices if you need a different GPU.
|
| AWS is stupidly expensive.
| hereonout2 wrote:
| Expensive in general but combine some decent tooling and spot
| instances and it can be insanely cheap.
|
| The latest Nvidia L4 GPUs (24GB) instances are currently less
| than 15c p/h spot.
|
| T4s are around 20c per hour spot, though they are smaller and
| slower.
|
| I've been provisioning hundreds of these at a time to do large
| batch jobs at a fraction of the price of commercial solutions
| (i.e. 10-100x cheaper).
|
| Any problem that fits in a smaller GPU and can be expressed as
| a batch job using spot instances can be done very cheaply on
| AWS.
| waldrews wrote:
| Hetzner GPU servers at $200/month for an RTX 4000 with 20GB seem
| competitive. Anyone have experience with what kind of token
| throughput you could get with that?
| cheptsov wrote:
| With dstack you can either utilize multiple affordable cloud GPU
| providers at once to get the cheapest GPU offer or also use an
| own cluster of on-prem servers. Dstack supports both altogether.
| Disclaimer: I'm a core contributor to dstack
| visarga wrote:
| I just bought a $1099 MacBook Air M3, I get about 10 tokens/s for
| a q5 quant. Doesn't even get hot, and I can take it with me on
| the plane. It's really easy to install ollama.
___________________________________________________________________
(page generated 2024-06-14 23:01 UTC)