[HN Gopher] JetMoE: Reaching LLaMA2 performance with 0.1M dollars
       ___________________________________________________________________
        
       JetMoE: Reaching LLaMA2 performance with 0.1M dollars
        
       Author : gyre007
       Score  : 136 points
       Date   : 2024-04-04 17:04 UTC (5 hours ago)
        
 (HTM) web link (research.myshell.ai)
 (TXT) w3m dump (research.myshell.ai)
        
       | helloericsf wrote:
       | X thread: https://x.com/qinzytech/status/1775916338822709755?s=20
        
       | plufz wrote:
       | You've been in tech for too long when 1 million USD is your
       | smallest unit.
        
         | noodlesUK wrote:
         | I wonder why they decided to call it 0.1M USD rather than 100k
         | USD. For many of us, a million dollars is a large amount of
         | money, even for a business.
        
           | mattlondon wrote:
           | Same reason things are x.99 - the 0.1 decimal "feels" smaller
           | than seeing 100,000 - "holy fuck!" Etc
        
           | plufz wrote:
           | I'm sure they had their reasons, but all I can see is the
           | Simpsons meme with Mr Burns at the ATM saying "What's the
           | smallest amount of money I can think of? A thousand dollars."
           | ;)
        
           | IshKebab wrote:
           | It's to imply that it costs other people in the millions, but
           | they did it for only 0.1 million, which is a small number of
           | millions. Just a rhetorical trick.
        
           | oceanplexian wrote:
           | 100k isn't worth anywhere near what it used to due to
           | inflation. It might get you a nice pickup truck or a kitchen
           | remodel. If your business is doing research and can't spend
           | then it's more of a hobby than a business.
        
         | uptownfunk wrote:
         | Well, it's interesting to think about how much has been
         | invested into BigModel companies (Anthropic, Perplexity,
         | OpenAI) when it's very rapidly becoming commoditized.
        
       | ipsum2 wrote:
       | I'm skeptical, expect data contamination was the reason for high
       | benchmark scores.
        
         | hiddencost wrote:
         | Yeah. IBM especially has a history of fudging the numbers on
         | reports like this. Research puts together reports which are
         | aggressively p-hacked and ensembled and overfit, and then sales
         | uses those reports to boondoggle clients into using IBM.
        
       | throwitaway222 wrote:
       | This stuff is just going to keep getting pushed down.
        
       | lolinder wrote:
       | > JetMoE-8B is trained with less than $ 0.1 million1 cost but
       | outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar
       | training resources. LLM training can be much cheaper than people
       | generally thought.
       | 
       | They want you to read this as "we spent $100k compared to Meta's
       | spending billions", but that's not actually what this says. It
       | says that they spent $100k and Meta _has the resources_ to spend
       | billions if they wanted to.
       | 
       | We don't know what Facebook spent on training LLaMA 2, but they
       | say that it took them 184320 A100-80GB GPU-hours to train the 7B
       | model [0]. AWS charges $14.46/hour for an instance that has 8 of
       | those [1], which amounts to $1.81/GPU/hr.
       | 
       | At that rate and assuming they paid something resembling AWS's
       | list price, LLaMA 2 7B cost ~$333k. That's more than $100k, but
       | not by orders of magnitude, and it's likely that Facebook wasn't
       | paying the full price AWS is charging today.
       | 
       | [0] https://github.com/meta-
       | llama/llama/blob/main/MODEL_CARD.md#...
       | 
       | [1] https://aws.amazon.com/ec2/instance-types/p4/
        
         | nuz wrote:
         | Meta has their own data centers so they definitely didn't pay
         | the equivalent to what AWS costs
        
           | freedomben wrote:
           | Good point, although it's possible that with the extreme
           | price of GPUs that it cost _more_ to train by buying hardware
           | than it would to rent. For example it might take two to three
           | years before the GPUs are paid for by customers.
        
             | fleischhauf wrote:
             | I think AWS prices scale with hardware price
        
             | greenavocado wrote:
             | Linux reserved cost of p3.16xlarge is $146,362.0800
             | annually. On-demand cost is $214,444.8000 annually
             | 
             | I am pretty damn sure I could build a 8 GPU Intel Xeon
             | E5-2686 v4 (Broadwell) (that's what Amazon uses - it's $30
             | to $75 on eBay) server for less than that and come out
             | ahead on electricity even at full throttle. RTX 4090 are
             | just under $2000 each on eBay.
             | 
             | 8 GPU x $2000 (RTX 4090) + $1000 (for the rest of the
             | computer) = $17,000
             | 
             | If pulling 2kW continuously at 15 cents per kW*hr for 1
             | year that's 2000 watts x 365 days x (0.15/(kWxhr)) or
             | $2,628
             | 
             | In total the computer will cost $19,628 if you throw it in
             | the dumpster at the end of each calendar year of using it.
             | 
             | If you stack internet cost of $200 a month on top, that's
             | $2400 a year, which raises your annual cost to: $22,028
             | 
             | This is still $124,334 cheaper per year than one AWS 8-GPU
             | server if you fully depreciate your own hardware at the end
             | of year 1 to $0.
             | 
             | I could hire an engineer in America to babysit it with the
             | money left over.
        
               | ghshephard wrote:
               | An A100-80 GPU goes for about $20K each.
        
               | greenavocado wrote:
               | The instances in question use Tesla V100-SXM2-16GB
        
               | OtherShrezzing wrote:
               | Are consumer grade RTX 4090 cards going to be suitable
               | for running full tilt 24/7 for a year? Those things are
               | fine to stress on the latest game for a few hours at a
               | time, but would probably cause some defects from
               | significant heat stress after just a few days at 100%.
               | 
               | This is inconsequential when you're playing Overwatch for
               | a few hours a night and a frame drops now and again. If
               | you're training an iteratively developed LLM though,
               | physical defects could propagate into huge deficiencies
               | in the final model.
        
               | jsight wrote:
               | I don't think they'd become a fire hazard, but it is true
               | that one would likely pick something else for this
               | application.
               | 
               | Having said that, switching to something like the Tesla
               | V100-SXM2-16GB wouldn't cost that much more.
               | 
               | TBH, I'm shocked at how many people treat Amazon as the
               | first choice for this stuff. Much of it isn't even what
               | most would consider a "production" workload. You are
               | paying for a lot of enterprise-readiness that you don't
               | need for training.
        
               | robrenaud wrote:
               | If you wanted to finetune a Mixtral 8x7B, what would you
               | use?
        
               | oceanplexian wrote:
               | Yep absolutely, crypto miners have been doing it for
               | years.
               | 
               | I still think it would be impractical at scale because
               | they are so much more hot and power hungry than the
               | datacenter cards, and you would be lucky to score one or
               | two if you're on a wait list.
        
               | renewiltord wrote:
               | You can build old Xeon based but only has 40 lane PCIe.
               | For training 8 GPUs how do you push data fast? I'm using
               | 7000 series Epyc for this to get 128 lanes. Have you
               | built this kind of machine? You see good speed with 40
               | lane? Curious because then I can use old Tyan motherboard
               | which comes in full case with good layout for multi GPU.
               | Epyc based I have to use riser and custom frame which is
               | painful.
               | 
               | New Tyan more costly but great case layout.
        
               | vidarh wrote:
               | Though this comparison is really only relevant for a
               | couple of machines. Beyond that, at this cost, if you pay
               | AWS list prices "at scale" you're doing something very
               | wrong.
               | 
               | Don't get me wrong - I've frequently argued that AWS is
               | price gouging and relying on peoples lack of
               | understanding of how the devops costs of running your own
               | works out, but it doesn't take a huge budget before this
               | calculation will look very different (still cheaper to
               | own your own, though).
        
           | packetslave wrote:
           | Meta also doesn't pay AWS anywhere near retail price for
           | instances.
        
             | KMnO4 wrote:
             | Why is this the case? Even AWS internal pays the same AWS
             | prices as everyone else
        
               | elcomet wrote:
               | They have their own data centers, they don't use AWS
        
               | vidarh wrote:
               | I'm less surprised if AWS internally pays AWS list
               | prices, because that's just internal accounting. From the
               | even relatively small AWS customers I know, none of them
               | needed to get very far into the 6 digits per year spend
               | before a couple of quiet mentions to their account
               | manager that they were reviewing other options was enough
               | to get steep discounts.
               | 
               | Add in lots of credits, and if you pay list price, you're
               | being taken to the cleaners..
               | 
               | I've done contract work for clients to be ready to
               | migrate both as part of maximising credits and as part of
               | negotiating posture, and the savings can be enormous
               | (though it'd still usually be cheaper to use managed
               | servers).
        
         | DalasNoin wrote:
         | This entire difference can be explained due to their double
         | mixture of experts architecture. So only 1/4 MLP and attention
         | blocks are used at any time. Maybe this should be the headline,
         | Moe reduces compute by a factor of 4 without losing accuracy.
         | But this is already known. Still interesting to see a smaller
         | Moe model. This could be the ideal size for many local
         | applications.
        
           | Centigonal wrote:
           | MoE reduces compute cost for inference at scale, but not for
           | training. You still have to train the whole model (plus the
           | router)
        
           | smusamashah wrote:
           | If MoEs are that good, we know GPT-4 is, than why not train
           | very specific MoEs. One part of MoE could be a perfect Math
           | model which can actually calculate 2+2. Wouldn't models like
           | these be better in general?
        
             | refulgentis wrote:
             | Keeping it short: "Not even wrong", in the Pauli sense.
             | 
             | - People hear "mixture of experts" and they think "N
             | specialists" - but ex. think how much know you need to know
             | to autocomplete "Two plus two is "
             | 
             | - Fundamental thing of ML is you define functions and give
             | it data, and the more data you give it to the better. Once
             | youre at "I will simply give it the training data needed to
             | be good enough at the task and wall off that part of the
             | implementation" you're outside ML and have a chicken and
             | egg problem
             | 
             | - We don't know GPT-4 is MoE
             | 
             | - MoE in practice is fundamentally about trading off
             | runtime vs. static size properties to gain inference speed.
             | I.e. 7x8 stored and picking 7x2 at runtime means youre
             | somewhere between 7x2 and 7x3 in quality, inference at 7x2
             | speed, and have to train and store and load 7x8. You don't
             | reach for it to increase quality, you reach for it to
             | increase inference speed at the expense of inference ram
             | and total model size.
        
         | dannyw wrote:
         | Any company as big as Meta have teams working on optimisation
         | (eg optimised kernels), usually with direct engagement with
         | NVIDIA engineers.
         | 
         | These kind of things are usually only selectively shared.
        
         | benreesman wrote:
         | I'll agree with your general point even though there are some
         | subtleties to FBNY.
         | 
         | More important, we let an awful lot of self-promotion from the
         | big guys slide around here.
         | 
         | I can live with the guys and gals doing this on a shoestring
         | getting a little of that sweet hype love. This seems pretty
         | legit.
        
       | tosh wrote:
       | Anyone got a ballpark figure for what Meta spent on Llama 2
       | training for the 7B model?
        
       | antimatter15 wrote:
       | It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to
       | train[1]. This one says it used a 96xH100 GPU cluster for 2
       | weeks, for 32,256 hours. That's 17.5% of the number of hours, but
       | H100s are faster than A100s [2] and FP16/bfloat16 performance is
       | ~3x better.
       | 
       | If they had tried to replicate Llama 2 identically with their
       | hardware setup, it'd cost a little bit less than twice their MoE
       | model.
       | 
       | [1] https://github.com/meta-
       | llama/llama/blob/main/MODEL_CARD.md#...
       | 
       | [2] https://blog.ori.co/choosing-between-
       | nvidia-h100-vs-a100-per...
        
         | anizan wrote:
         | They mention the cost was ~80,000k USD so for 32,256 hours it
         | comes to ~2.48$ an hour. Amazing how cost effective the compute
         | actually is.
        
       | davidcollantes wrote:
       | Can't wait for the GGUF to play with it. I tried the demo
       | (https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat), and
       | the results were very good!
        
       | kleiba wrote:
       | I've been out of academia for a bit, but in my day 100k USD would
       | _not_ have been considered academia-friendly in my neck of the
       | woods...
        
         | dheera wrote:
         | That's about the cost of 1 grad student year including all
         | overhead, I believe. It's definitely far less than what many
         | physics and biology labs spend on equipment in a year.
         | 
         | I mean, you're an idiot of a PI if you have $500K/year of
         | grants and spend it on 5 students and no compute.
        
       | barkingcat wrote:
       | This kind of assumption is super deceptive.
       | 
       | The Facebook budget includes money to pay off people they've
       | ripped off (in private settlements) and money for lawyers to
       | shield the developers so they can feel free to rip off
       | copyrighted content without having to pay personal penalty or be
       | imprisoned for infringement. It also includes the price of buying
       | lobbyists to alter laws to let this practice continue.
       | 
       | Also, unless the authors work inside Facebook, they have no idea
       | how much Facebook spent on training that model specifically.
        
       | ein0p wrote:
       | At $DAY_JOB nowadays we run 128x H100 runs without thinking twice
       | nowadays. Only takes a few days to train a small-ish LLM with
       | that to test out some ideas.
        
         | uptownfunk wrote:
         | Where are they hosted?
        
           | ein0p wrote:
           | AWS and GCP both.
        
             | turnsout wrote:
             | Out of curiosity, what leads you to train models from the
             | ground up rather than fine tuning existing models?
        
               | ein0p wrote:
               | We do both. You can't just fine tune if you're trying a
               | different model architecture, or even change some of the
               | hyperparameters on an existing one. Every now and again
               | you might be able to reuse some of the weights, but
               | that's about it. That's part of the reason research is so
               | incredibly expensive and time consuming in this field. I
               | bet that $80k is only a fraction of the overall cost for
               | the model described in the article, too.
        
         | moffkalast wrote:
         | echo $DAY_JOB
        
       | YetAnotherNick wrote:
       | > It only uses public datasets for training, and the code is
       | open-sourced
       | 
       | Looking at the repo, there is no training or data processing
       | code.
        
         | jsight wrote:
         | It was trained with "1 trillion tokens from large-scale open-
         | source pretraining datasets, including RefinedWeb, Pile, Github
         | data, etc."
         | 
         | I guess it is good that they mentioned some of it, but yeah,
         | that isn't exceptionally helpful when making claims of it being
         | 100% open source.
         | 
         | I'm not sure why they feel the need to be so secretive if all
         | of the sources are open.
        
       | vertis wrote:
       | It might be equivalent to LLaMA2 but it's still not capable of
       | even simple reasoning:
       | 
       | > If two apples cost 2 dollars and 1 apple costs 1.20, what is
       | the discount you're getting
       | 
       | > To calculate the discount, you need to compare the original
       | price of the apples to the price after the discount.
       | 
       | > The original price for two apples is: 2 apples * $1.20/apple =
       | $2.40
       | 
       | > The price for one apple after the discount is: 1 apple *
       | $2.00/apple = $2.00
       | 
       | > Now, let's calculate the discount for one apple: > Original
       | price for one apple - Discounted price for one apple = Discount
       | amount > $2.40 - $2.00 = $0.40
       | 
       | > The discount for one apple is $0.40. Since you're buying two
       | apples, you'll get the discount on both: > Discount amount for
       | two apples = 2 * $0.40 = $0.80
       | 
       | > So, the discount you're getting for two apples is $0.80.
        
         | jsight wrote:
         | I wonder if the reasoning capability of a moe model is limited
         | by the max size of the individual experts?
        
         | moffkalast wrote:
         | The simple fact that they did not list the current SOTA for the
         | size class in their comparison table tells you all you need to
         | know about their confidence. And listing Gemma-2B is like
         | shooting fish in a barrel, might as well also put RedPajama on
         | there.
         | 
         | It's good to see MoE being attempted at the smaller sizes, and
         | it may scale well downwards as well given their results. But
         | regardless, 1.25T is very little training data compared to the
         | 6T that Mistral 7B received and even that makes it barely
         | usable and likely not yet saturated. Before it, the sub-13B
         | size class was considered basically an academic exercise.
        
       | operator-name wrote:
       | Sadly no matter what kind of prompting that I try it fails the
       | sisters question:
       | 
       | > Sally (a girl) has three brothers. Each of her brothers has two
       | sisters. How many sisters does Sally have?
       | 
       | Sally (a girl) has three brothers. Each of her brothers has two
       | sisters. How many sisters does Sally have?
       | 
       | Sally has two sisters. Here's the reasoning: Sally has three
       | brothers. Each of her brothers has two sisters. Since her
       | brothers are the ones who have two sisters each, and we know
       | there are three brothers, we can conclude that each brother has
       | two sisters. Therefore, there are two sisters in total (one for
       | each brother). So, Sally has two sisters.
       | 
       | https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat
        
         | moffkalast wrote:
         | Not exactly something you'd expect a model of this size to
         | solve, Mixtral fails it too (if you switch the names/genders
         | that is, since it's contaminated with a few versions). It does
         | at least indicate that their training data might indeed be as
         | clean as they say.
        
       | patrick-fitz wrote:
       | Out of curiosity, looking at the cheapest price for a H100 that I
       | could find online.
       | 
       | Lambda Reserved Cloud [1] starts at $1.89 per H100 per hour.
       | 
       | It could be possible to get the cost down to a lower amount:
       | 
       | $1.89 * 96GPUs * 24hours * 14days = ~$61k
       | 
       | 1 - https://lambdalabs.com/deep-learning/servers/hyperplane
        
         | avrionov wrote:
         | This is the price of training if nothing fails.
        
       ___________________________________________________________________
       (page generated 2024-04-04 23:00 UTC)