[HN Gopher] JetMoE: Reaching LLaMA2 performance with 0.1M dollars
___________________________________________________________________
JetMoE: Reaching LLaMA2 performance with 0.1M dollars
Author : gyre007
Score : 136 points
Date : 2024-04-04 17:04 UTC (5 hours ago)
(HTM) web link (research.myshell.ai)
(TXT) w3m dump (research.myshell.ai)
| helloericsf wrote:
| X thread: https://x.com/qinzytech/status/1775916338822709755?s=20
| plufz wrote:
| You've been in tech for too long when 1 million USD is your
| smallest unit.
| noodlesUK wrote:
| I wonder why they decided to call it 0.1M USD rather than 100k
| USD. For many of us, a million dollars is a large amount of
| money, even for a business.
| mattlondon wrote:
| Same reason things are x.99 - the 0.1 decimal "feels" smaller
| than seeing 100,000 - "holy fuck!" Etc
| plufz wrote:
| I'm sure they had their reasons, but all I can see is the
| Simpsons meme with Mr Burns at the ATM saying "What's the
| smallest amount of money I can think of? A thousand dollars."
| ;)
| IshKebab wrote:
| It's to imply that it costs other people in the millions, but
| they did it for only 0.1 million, which is a small number of
| millions. Just a rhetorical trick.
| oceanplexian wrote:
| 100k isn't worth anywhere near what it used to due to
| inflation. It might get you a nice pickup truck or a kitchen
| remodel. If your business is doing research and can't spend
| then it's more of a hobby than a business.
| uptownfunk wrote:
| Well, it's interesting to think about how much has been
| invested into BigModel companies (Anthropic, Perplexity,
| OpenAI) when it's very rapidly becoming commoditized.
| ipsum2 wrote:
| I'm skeptical, expect data contamination was the reason for high
| benchmark scores.
| hiddencost wrote:
| Yeah. IBM especially has a history of fudging the numbers on
| reports like this. Research puts together reports which are
| aggressively p-hacked and ensembled and overfit, and then sales
| uses those reports to boondoggle clients into using IBM.
| throwitaway222 wrote:
| This stuff is just going to keep getting pushed down.
| lolinder wrote:
| > JetMoE-8B is trained with less than $ 0.1 million1 cost but
| outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar
| training resources. LLM training can be much cheaper than people
| generally thought.
|
| They want you to read this as "we spent $100k compared to Meta's
| spending billions", but that's not actually what this says. It
| says that they spent $100k and Meta _has the resources_ to spend
| billions if they wanted to.
|
| We don't know what Facebook spent on training LLaMA 2, but they
| say that it took them 184320 A100-80GB GPU-hours to train the 7B
| model [0]. AWS charges $14.46/hour for an instance that has 8 of
| those [1], which amounts to $1.81/GPU/hr.
|
| At that rate and assuming they paid something resembling AWS's
| list price, LLaMA 2 7B cost ~$333k. That's more than $100k, but
| not by orders of magnitude, and it's likely that Facebook wasn't
| paying the full price AWS is charging today.
|
| [0] https://github.com/meta-
| llama/llama/blob/main/MODEL_CARD.md#...
|
| [1] https://aws.amazon.com/ec2/instance-types/p4/
| nuz wrote:
| Meta has their own data centers so they definitely didn't pay
| the equivalent to what AWS costs
| freedomben wrote:
| Good point, although it's possible that with the extreme
| price of GPUs that it cost _more_ to train by buying hardware
| than it would to rent. For example it might take two to three
| years before the GPUs are paid for by customers.
| fleischhauf wrote:
| I think AWS prices scale with hardware price
| greenavocado wrote:
| Linux reserved cost of p3.16xlarge is $146,362.0800
| annually. On-demand cost is $214,444.8000 annually
|
| I am pretty damn sure I could build a 8 GPU Intel Xeon
| E5-2686 v4 (Broadwell) (that's what Amazon uses - it's $30
| to $75 on eBay) server for less than that and come out
| ahead on electricity even at full throttle. RTX 4090 are
| just under $2000 each on eBay.
|
| 8 GPU x $2000 (RTX 4090) + $1000 (for the rest of the
| computer) = $17,000
|
| If pulling 2kW continuously at 15 cents per kW*hr for 1
| year that's 2000 watts x 365 days x (0.15/(kWxhr)) or
| $2,628
|
| In total the computer will cost $19,628 if you throw it in
| the dumpster at the end of each calendar year of using it.
|
| If you stack internet cost of $200 a month on top, that's
| $2400 a year, which raises your annual cost to: $22,028
|
| This is still $124,334 cheaper per year than one AWS 8-GPU
| server if you fully depreciate your own hardware at the end
| of year 1 to $0.
|
| I could hire an engineer in America to babysit it with the
| money left over.
| ghshephard wrote:
| An A100-80 GPU goes for about $20K each.
| greenavocado wrote:
| The instances in question use Tesla V100-SXM2-16GB
| OtherShrezzing wrote:
| Are consumer grade RTX 4090 cards going to be suitable
| for running full tilt 24/7 for a year? Those things are
| fine to stress on the latest game for a few hours at a
| time, but would probably cause some defects from
| significant heat stress after just a few days at 100%.
|
| This is inconsequential when you're playing Overwatch for
| a few hours a night and a frame drops now and again. If
| you're training an iteratively developed LLM though,
| physical defects could propagate into huge deficiencies
| in the final model.
| jsight wrote:
| I don't think they'd become a fire hazard, but it is true
| that one would likely pick something else for this
| application.
|
| Having said that, switching to something like the Tesla
| V100-SXM2-16GB wouldn't cost that much more.
|
| TBH, I'm shocked at how many people treat Amazon as the
| first choice for this stuff. Much of it isn't even what
| most would consider a "production" workload. You are
| paying for a lot of enterprise-readiness that you don't
| need for training.
| robrenaud wrote:
| If you wanted to finetune a Mixtral 8x7B, what would you
| use?
| oceanplexian wrote:
| Yep absolutely, crypto miners have been doing it for
| years.
|
| I still think it would be impractical at scale because
| they are so much more hot and power hungry than the
| datacenter cards, and you would be lucky to score one or
| two if you're on a wait list.
| renewiltord wrote:
| You can build old Xeon based but only has 40 lane PCIe.
| For training 8 GPUs how do you push data fast? I'm using
| 7000 series Epyc for this to get 128 lanes. Have you
| built this kind of machine? You see good speed with 40
| lane? Curious because then I can use old Tyan motherboard
| which comes in full case with good layout for multi GPU.
| Epyc based I have to use riser and custom frame which is
| painful.
|
| New Tyan more costly but great case layout.
| vidarh wrote:
| Though this comparison is really only relevant for a
| couple of machines. Beyond that, at this cost, if you pay
| AWS list prices "at scale" you're doing something very
| wrong.
|
| Don't get me wrong - I've frequently argued that AWS is
| price gouging and relying on peoples lack of
| understanding of how the devops costs of running your own
| works out, but it doesn't take a huge budget before this
| calculation will look very different (still cheaper to
| own your own, though).
| packetslave wrote:
| Meta also doesn't pay AWS anywhere near retail price for
| instances.
| KMnO4 wrote:
| Why is this the case? Even AWS internal pays the same AWS
| prices as everyone else
| elcomet wrote:
| They have their own data centers, they don't use AWS
| vidarh wrote:
| I'm less surprised if AWS internally pays AWS list
| prices, because that's just internal accounting. From the
| even relatively small AWS customers I know, none of them
| needed to get very far into the 6 digits per year spend
| before a couple of quiet mentions to their account
| manager that they were reviewing other options was enough
| to get steep discounts.
|
| Add in lots of credits, and if you pay list price, you're
| being taken to the cleaners..
|
| I've done contract work for clients to be ready to
| migrate both as part of maximising credits and as part of
| negotiating posture, and the savings can be enormous
| (though it'd still usually be cheaper to use managed
| servers).
| DalasNoin wrote:
| This entire difference can be explained due to their double
| mixture of experts architecture. So only 1/4 MLP and attention
| blocks are used at any time. Maybe this should be the headline,
| Moe reduces compute by a factor of 4 without losing accuracy.
| But this is already known. Still interesting to see a smaller
| Moe model. This could be the ideal size for many local
| applications.
| Centigonal wrote:
| MoE reduces compute cost for inference at scale, but not for
| training. You still have to train the whole model (plus the
| router)
| smusamashah wrote:
| If MoEs are that good, we know GPT-4 is, than why not train
| very specific MoEs. One part of MoE could be a perfect Math
| model which can actually calculate 2+2. Wouldn't models like
| these be better in general?
| refulgentis wrote:
| Keeping it short: "Not even wrong", in the Pauli sense.
|
| - People hear "mixture of experts" and they think "N
| specialists" - but ex. think how much know you need to know
| to autocomplete "Two plus two is "
|
| - Fundamental thing of ML is you define functions and give
| it data, and the more data you give it to the better. Once
| youre at "I will simply give it the training data needed to
| be good enough at the task and wall off that part of the
| implementation" you're outside ML and have a chicken and
| egg problem
|
| - We don't know GPT-4 is MoE
|
| - MoE in practice is fundamentally about trading off
| runtime vs. static size properties to gain inference speed.
| I.e. 7x8 stored and picking 7x2 at runtime means youre
| somewhere between 7x2 and 7x3 in quality, inference at 7x2
| speed, and have to train and store and load 7x8. You don't
| reach for it to increase quality, you reach for it to
| increase inference speed at the expense of inference ram
| and total model size.
| dannyw wrote:
| Any company as big as Meta have teams working on optimisation
| (eg optimised kernels), usually with direct engagement with
| NVIDIA engineers.
|
| These kind of things are usually only selectively shared.
| benreesman wrote:
| I'll agree with your general point even though there are some
| subtleties to FBNY.
|
| More important, we let an awful lot of self-promotion from the
| big guys slide around here.
|
| I can live with the guys and gals doing this on a shoestring
| getting a little of that sweet hype love. This seems pretty
| legit.
| tosh wrote:
| Anyone got a ballpark figure for what Meta spent on Llama 2
| training for the 7B model?
| antimatter15 wrote:
| It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to
| train[1]. This one says it used a 96xH100 GPU cluster for 2
| weeks, for 32,256 hours. That's 17.5% of the number of hours, but
| H100s are faster than A100s [2] and FP16/bfloat16 performance is
| ~3x better.
|
| If they had tried to replicate Llama 2 identically with their
| hardware setup, it'd cost a little bit less than twice their MoE
| model.
|
| [1] https://github.com/meta-
| llama/llama/blob/main/MODEL_CARD.md#...
|
| [2] https://blog.ori.co/choosing-between-
| nvidia-h100-vs-a100-per...
| anizan wrote:
| They mention the cost was ~80,000k USD so for 32,256 hours it
| comes to ~2.48$ an hour. Amazing how cost effective the compute
| actually is.
| davidcollantes wrote:
| Can't wait for the GGUF to play with it. I tried the demo
| (https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat), and
| the results were very good!
| kleiba wrote:
| I've been out of academia for a bit, but in my day 100k USD would
| _not_ have been considered academia-friendly in my neck of the
| woods...
| dheera wrote:
| That's about the cost of 1 grad student year including all
| overhead, I believe. It's definitely far less than what many
| physics and biology labs spend on equipment in a year.
|
| I mean, you're an idiot of a PI if you have $500K/year of
| grants and spend it on 5 students and no compute.
| barkingcat wrote:
| This kind of assumption is super deceptive.
|
| The Facebook budget includes money to pay off people they've
| ripped off (in private settlements) and money for lawyers to
| shield the developers so they can feel free to rip off
| copyrighted content without having to pay personal penalty or be
| imprisoned for infringement. It also includes the price of buying
| lobbyists to alter laws to let this practice continue.
|
| Also, unless the authors work inside Facebook, they have no idea
| how much Facebook spent on training that model specifically.
| ein0p wrote:
| At $DAY_JOB nowadays we run 128x H100 runs without thinking twice
| nowadays. Only takes a few days to train a small-ish LLM with
| that to test out some ideas.
| uptownfunk wrote:
| Where are they hosted?
| ein0p wrote:
| AWS and GCP both.
| turnsout wrote:
| Out of curiosity, what leads you to train models from the
| ground up rather than fine tuning existing models?
| ein0p wrote:
| We do both. You can't just fine tune if you're trying a
| different model architecture, or even change some of the
| hyperparameters on an existing one. Every now and again
| you might be able to reuse some of the weights, but
| that's about it. That's part of the reason research is so
| incredibly expensive and time consuming in this field. I
| bet that $80k is only a fraction of the overall cost for
| the model described in the article, too.
| moffkalast wrote:
| echo $DAY_JOB
| YetAnotherNick wrote:
| > It only uses public datasets for training, and the code is
| open-sourced
|
| Looking at the repo, there is no training or data processing
| code.
| jsight wrote:
| It was trained with "1 trillion tokens from large-scale open-
| source pretraining datasets, including RefinedWeb, Pile, Github
| data, etc."
|
| I guess it is good that they mentioned some of it, but yeah,
| that isn't exceptionally helpful when making claims of it being
| 100% open source.
|
| I'm not sure why they feel the need to be so secretive if all
| of the sources are open.
| vertis wrote:
| It might be equivalent to LLaMA2 but it's still not capable of
| even simple reasoning:
|
| > If two apples cost 2 dollars and 1 apple costs 1.20, what is
| the discount you're getting
|
| > To calculate the discount, you need to compare the original
| price of the apples to the price after the discount.
|
| > The original price for two apples is: 2 apples * $1.20/apple =
| $2.40
|
| > The price for one apple after the discount is: 1 apple *
| $2.00/apple = $2.00
|
| > Now, let's calculate the discount for one apple: > Original
| price for one apple - Discounted price for one apple = Discount
| amount > $2.40 - $2.00 = $0.40
|
| > The discount for one apple is $0.40. Since you're buying two
| apples, you'll get the discount on both: > Discount amount for
| two apples = 2 * $0.40 = $0.80
|
| > So, the discount you're getting for two apples is $0.80.
| jsight wrote:
| I wonder if the reasoning capability of a moe model is limited
| by the max size of the individual experts?
| moffkalast wrote:
| The simple fact that they did not list the current SOTA for the
| size class in their comparison table tells you all you need to
| know about their confidence. And listing Gemma-2B is like
| shooting fish in a barrel, might as well also put RedPajama on
| there.
|
| It's good to see MoE being attempted at the smaller sizes, and
| it may scale well downwards as well given their results. But
| regardless, 1.25T is very little training data compared to the
| 6T that Mistral 7B received and even that makes it barely
| usable and likely not yet saturated. Before it, the sub-13B
| size class was considered basically an academic exercise.
| operator-name wrote:
| Sadly no matter what kind of prompting that I try it fails the
| sisters question:
|
| > Sally (a girl) has three brothers. Each of her brothers has two
| sisters. How many sisters does Sally have?
|
| Sally (a girl) has three brothers. Each of her brothers has two
| sisters. How many sisters does Sally have?
|
| Sally has two sisters. Here's the reasoning: Sally has three
| brothers. Each of her brothers has two sisters. Since her
| brothers are the ones who have two sisters each, and we know
| there are three brothers, we can conclude that each brother has
| two sisters. Therefore, there are two sisters in total (one for
| each brother). So, Sally has two sisters.
|
| https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat
| moffkalast wrote:
| Not exactly something you'd expect a model of this size to
| solve, Mixtral fails it too (if you switch the names/genders
| that is, since it's contaminated with a few versions). It does
| at least indicate that their training data might indeed be as
| clean as they say.
| patrick-fitz wrote:
| Out of curiosity, looking at the cheapest price for a H100 that I
| could find online.
|
| Lambda Reserved Cloud [1] starts at $1.89 per H100 per hour.
|
| It could be possible to get the cost down to a lower amount:
|
| $1.89 * 96GPUs * 24hours * 14days = ~$61k
|
| 1 - https://lambdalabs.com/deep-learning/servers/hyperplane
| avrionov wrote:
| This is the price of training if nothing fails.
___________________________________________________________________
(page generated 2024-04-04 23:00 UTC)