[HN Gopher] Training LLMs from ground zero as a startup
___________________________________________________________________
Training LLMs from ground zero as a startup
Author : swyx
Score : 395 points
Date : 2024-03-05 22:31 UTC (2 days ago)
(HTM) web link (www.yitay.net)
(TXT) w3m dump (www.yitay.net)
| swyx wrote:
| for context Yi Tay was Tech Lead on Google PaLM, UL2, Flan, Bard,
| etc and now is cofoudner at Reka (which has shipped some v
| interesting small multimodal models that have featured on here).
| I prompted him for this post as an ex-Googler now training LLMs
| as an independent startup
| https://twitter.com/YiTayML/status/1765105066263052718
|
| our conversation was recorded here
| https://sub.thursdai.news/p/thursdai-feb-15-2024-openai-chan...
| swyx wrote:
| (update: i submitted this yesterday and it didnt get traction,
| i guess @dang must've merged the old submission in here. you
| really didnt have to, but its a nice gesture. thanks dang!!)
| axpy906 wrote:
| Great too see you on here. Love Latent Space podcast.
| swyx wrote:
| aw thank you for listening. some weeks its very much a
| labor of love lol.
|
| no events planned near term but come to the big shindig in
| june https://ti.to/software-3/ai-engineer-worlds-fair .
| last year's summit was the first time i really understood
| how much of a reach we have and how many good AI people
| we've managed to gather as friends.
| dwaltrip wrote:
| I love it as well, it's a fantastic resource :)
| 3abiton wrote:
| Is he the person after the Yi LLM model?
| bigcat12345678 wrote:
| No Yi LLM models are from [0], Kaifu Li's LLM startup.
|
| [0] https://www.lingyiwanwu.com/
| pama wrote:
| Training LLM from scratch is a super important issue that affects
| the pace and breadth of iteration of AI almost as much as the raw
| hardware improvements do. The blog is fun but somewhat shallow
| and not technical or very surprising if you've worked with
| clusters of GPUs in any capacity over the years. (I liked the
| perspective of a former googler, but I'm not sure why past
| colleagues would recommend Jax over pytorch for LLMs outside of
| Google.) I hope this newco eventually releases a more technical
| report about their training adventures, like the PDF file here:
| https://github.com/facebookresearch/metaseq/tree/main/projec...
| axpy906 wrote:
| If you're doing research JAX makes some sense. Probably some
| Google bias in there too.
| lyapunova wrote:
| To be honest, most researchers in applied ML in the bay say
| the opposite. If you are trying to be nimble and prototype,
| use pytorch. If you're trying to gain some optimizations as
| you near deployment, rewrite in Jax.
| plumeria wrote:
| Where does Tensorflow stand in this?
| axpy906 wrote:
| Somewhere next to Theano, Mxnet or Caffe.
| plumeria wrote:
| So, obsolete?
| omneity wrote:
| What about Keras?
| rockinghigh wrote:
| Tensorflow has been falling behind since they stopped
| caring about backward compatibility. PyTorch is the
| leading framework. Jax is getting some traction at Google
| and was used to train Gemini.
| axpy906 wrote:
| Interesting. I've never heard that. I could see that
| argument going both ways as PyTorch has the larger
| ecosystem and is published the most.
| pama wrote:
| Interesting perspective about possible Jax optimizations.
| Assuming these models are trained and deployed on non-TPU
| hardware, are there any real advantages in using Jax for
| deployment on GPU? I'd have assumed that inference is
| largely a solved optimization for large transformer based
| models (with any low hanging fruits from custom CUDA code
| already written) and the details are shifting towards
| infrastructure tradeoffs and availability of efficient
| GPUs. But I may be out of the loop with the latest gossip.
| Or do you simply mean that maybe there exist cases where
| TPU inference makes sense financially and using jax makes a
| difference?
| abeppu wrote:
| It's worth taking a second to note that the author just assumes
| that readers understand "the wilderness" to mean "not Google".
|
| This post gives a lot of credit to Google's infra and hardware
| teams, and I'd love to read a perspective from one of those
| insiders who then went on to do related work elsewhere.
| joe_the_user wrote:
| I took the phrase to mean "outside any large company". It seems
| like a fairly obvious metaphor; if you have a starup working on
| a large scale infrastructure project, you have to set your own
| logistics just a camp in the literal wildness.
| choppaface wrote:
| Really telling quote:
|
| > I was completely taken aback by the failure rate of GPUs as
| opposed to my experiences on TPUs at Google
|
| Should be "I was completely unaware of the failure modes of
| GPUs, because all my career I've been inside Google and used
| Google TPUs and was well-acquainted with those failure modes."
|
| I've used GPUs mostly, and when I tried TPUs the jobs failed
| _all the time_ for really hard-to-debug reasons. Often the
| indirection between the x86 chip and the TPU device caused
| hours of hair-pulling, stuff you never get with
| x86+nvidia+pytorch.
|
| 10-15 years ago, Google minted many $10m+ data scientists (aka
| Sawzall engineers) who also ventured "into the wilderness" and
| had very similar reactions. This blog post is much more about
| the OP hyping his company and personal brand than contributing
| useful notes to the community.
| StarCyan wrote:
| When was this? I use JAX+TPUs to train LLMs and haven't
| experienced many issues. IMO it was way easier to set up
| distributed training, sharding, etc compared to Pytorch+GPUs.
| quadrature wrote:
| I think the OP is referring to hardware failures rather than
| software not playing well together.
| ganeshkrishnan wrote:
| OP mentions the failure rate of GPUs as "If this were in GPU
| land, it would have failed within the first few days for
| sure.".
|
| In my humble opinion, we never had failures of GPU even for
| large scale training. Our current training batch job is a 20GB
| json file which takes 6 hours just to load and has been running
| for more than 15 days with not a hiccup. And we are using the
| older Tesla T4.
|
| GPUs have memory constraint issues but if you can plan and work
| around it, I havent seen it crash in real life.
| teaearlgraycold wrote:
| Ha! We're also committing great sins of computation against
| T4s at our company. Hopefully, as I learn, things get less
| janky.
| gwern wrote:
| > And we are using the older Tesla T4.
|
| That's an undemanding and well-debugged chip by this point (6
| years ago!). So you aren't experiencing any of the pain
| people using A100s or H100s (never mind people who have to
| stand up clusters with B100s soon) are going through now.
| shrubble wrote:
| Have you checked if there is a faster way to parse your JSON?
| 3Gbytes/hour to load a file seems slow on today's CPUs...
| flybarrel wrote:
| What would be an ideal (or more appropriate) speed?
| shrubble wrote:
| Well it would depend on the specifics of the JSON file
| but eyeballing the stats at
| https://github.com/miloyip/nativejson-
| benchmark/tree/master seems to indicate that even on a
| 2015 MacBook the parsing proceeds using e.g. Configuru
| parser at several megabytes per second.
| nl wrote:
| > 20GB json file... takes 6 hours just to load
|
| Err you definitely should be doing something about that.
|
| 20GB on T4s (how many?) isn't really comparable to terabytes
| on thousands of A100s.
| lambersley wrote:
| Agreed. It reads like Seven of Nine realizing she's separated
| from the Collective and needs to rely lowly human capabilities.
| The insights into vendors was informative.
| flybarrel wrote:
| Newbie question - What happens after when an LLM training job
| experience a hardware failure? I don't suppose you lose all the
| training progress do you? Then the pain is mostly in the
| diagnostic of the problem and getting the cluster running
| again, but no need to worry about data loss right?
| yalok wrote:
| > All in all, this is only a small part of the story of how we
| started a company, raised some money, bought some chips and
| matched Gemini pro/GPT 3.5 and outperformed many others in less
| than a year having to build everything from scratch.
|
| I wonder what was the budget spent for the chips/cloud GPUs to
| achieve GPT 3.5 level LLM - at least in the order to magnitude -
| 2-5 millions?
| joe_the_user wrote:
| So essentially a startup in this context has a small number of
| people and a large amount of money for training clusters. The
| article describes many operation leasing servers - that you
| assume to go many startups (or existing firms).
|
| So it seems like you have the various LLM creators all doing
| roughly the same sort of thing (training with text and image
| data) with similar hardware and similar data. Each of these
| naturally has their own brand of "secret sauce" for
| distinguishing their venture. The various secret sauces can make
| a difference in the quality of an LLM's output.
|
| Yet overall, this seems like a massive, energy intensive exercise
| in redundancy.
| dauertewigkeit wrote:
| I don't think most of them have any kind of secret sauce. I
| think the founders hope to get bought out simply for being able
| to train "near-SOTA" LLMs. I guess achieving that level of
| skill and infra could be valuable enough to build upon.
| joe_the_user wrote:
| Sure, that's also a factor but I'd say it reinforces my main
| point.
| DeepChill wrote:
| Good point, so the only real differentiator would be the
| size & quality of the data being fed and the fine tuning
| done on the model? I wonder what else differentiates LLMs
| from each other
| Iulioh wrote:
| Alignment and censorship ?
| pests wrote:
| Alignment just means making it do what you want. LLMs
| just continue the sequence, the chat questions and
| response style we have now is an example of alignment (to
| what humans want).
| eru wrote:
| Alignment can mean making sure your LLM doesn't continue
| the sequence in embarrassing ways, eg by spouting
| politically incorrect sequences of words (even though
| those might have been common in the training data).
| friendzis wrote:
| In what way does this do more good than harm?
| eru wrote:
| In the sense of people caring about their models not
| saying embarrassing things?
|
| Different people have different goals, and they don't
| necessarily align with yours.
| llm_trw wrote:
| Also getting a golden ticket.
|
| Golliath 120b is still the best open source model and no
| one knows why since it's just two llama2 60b glued
| together.
| doctorpangloss wrote:
| Maybe it's simpler than that. Instead of spending money on
| compute that costs X and that cloud providers charge 20*X for,
| they could spend the money creating training data, but that
| story is way too hard to tell to investors.
| llm_trw wrote:
| >Yet overall, this seems like a massive, energy intensive
| exercise in redundancy.
|
| Keep in mind that this is also chaff to distract people from
| the real secret sauce. I imagine that just as many startups are
| hiring writers and photographers to create extremely well
| labelled uncontaminated data for training.
|
| One only need to look at the perverts over at civitai to see
| how far you can go with intensive labeling on a tiny compute
| budget.
| fennecbutt wrote:
| Us furries were properly tagging data on e6 for a long time
| before LLMs came about.
| PeterStuer wrote:
| "this seems like a massive, energy intensive exercise in
| redundancy"
|
| This is commonly refered to as a market working as intended.
| Yes, the waste from this type of redundency can be _massive_ ,
| especially if you realize that ultimately just a tiny
| percentage of these efforts will result in even moderate
| success. But it is the price to pay at the edge of progress. A
| planned monopoly might be more efficient (despite popular
| banter that just compares a megacorp or a gov, which is
| basically the same, to a single succesfull startup ignoring the
| 999 that tried and failed), but those seldom beat a market on
| innovation.
| polygamous_bat wrote:
| > This is commonly refered to as a market working as
| intended.
|
| Is it? Seems like market is unable to separate wheat from the
| chaff and is just throwing money around hoping to hit the
| jackpot. While AI has massive chance of affecting our lives,
| the investment market paints a pretty similar picture to what
| happened during the crypto boom.
| PeterStuer wrote:
| Our inability to predict future success from failiure is
| exactly why we have (massively inefficient) markets
| outcompeting centralized planned approaches.
| manquer wrote:
| is it any different from evolution?
| samus wrote:
| There are not that many of these startups actually. Most use
| cases of LLM can be backed with a fine-tune of an off-the-shelf
| foundation model. If you're training foundation models from
| scratch, you're entering a difficult-to-monetize market where
| the big boys could eat your lunch by just releasing a new
| foundation model that might be able to do more than 95% of what
| yours does.
| twelfthnight wrote:
| > To be very frank, I would have to say the quality of codebases
| externally significantly lag behind those I've been used to at
| Google
|
| Haven't worked at Google, anyone else share this sentiment? I
| always feel like working with Google code is typically not
| idiomatic and super difficult to go "under the hood" if anything
| isn't precisely on the happy path.
| winwang wrote:
| (not googler)
|
| Google's codebase is idiomatic to Google due to their strict
| language tooling. e.g. their C++ code stays away from advanced
| features. The tooling teams at Google have very strong say.
| twelfthnight wrote:
| I get that sense too. Probably does work awesome if you're
| inside. But man it's a mess when they externalize stuff. Just
| one example: their cloud platform CLI includes an entire
| python installation and takes 1.7G on disk, just to make API
| calls...
| jen20 wrote:
| I have never understood why cloud providers seem to think
| it is OK to write their CLIs in Python. The AWS one is too,
| and the Azure one went from Node.js to Python some time
| ago.
| anonymous-panda wrote:
| Packaging and stability reasons. Same for why it's a
| 1.7gb install - probably where they landed after having
| tons of support issues on some random Python version they
| didn't test or some issue with a dependency that had that
| issue. Freezing the entire set of artifacts is more
| stable and Python lets you move pretty quick. I can't
| speak to why nodejs vs Python though - maybe Python is
| easier to embed?
| pests wrote:
| What? They only get package and stability because they
| include the runtime. If they just went with a compiled
| language they could distribute native binaries and have
| actual packaging and stability.
| anonymous-panda wrote:
| Yes, but it's not just a single metric. Another is how
| easy it is for them to hire productive members of the
| team and how much that costs them - middling Python
| developers churning out fine"ish" code are cheaper than
| Rust developers doing the same. It's hard to find a
| language where you can be as productive as a developer in
| Python that also has AOT compilation to generate
| standalone binaries.
|
| Tldr: there's multiple factors to consider here and it's
| more interesting to understand the pressures that cause
| the decisions, especially if you want to try to create a
| world where different decisions are made.
| jen20 wrote:
| > It's hard to find a language where you can be as
| productive as a developer in Python that also has AOT
| compilation to generate standalone binaries.
|
| Outside specific cases around machine learning, it's
| really not: Go is that language. It's not like each of
| those platforms doesn't have to have a similar team that
| understand Go anyway (for their SDK), so they could save
| their customers the abject pain of Python dependency
| management by just writing their CLIs using it.
| twelfthnight wrote:
| Yeah, I imagine that was the decision calculus. "Instead
| of spending some more effort to save millions of
| unnecessary downloads of python's runtime using a
| different language, let's just bundle Python!"
|
| I wouldn't be surprised if it was version 2.7 too...
| jen20 wrote:
| Of course, writing them in Go would solve all of these
| problems while producing packages which are much smaller.
| twelfthnight wrote:
| There probably is a sense in which the API's are
| constantly changing, so maybe an interpreted language
| might make sense? I imagine there has to be a better way
| to do with with Go or Rust though (even lua?) for a
| smaller binary.
| candiodari wrote:
| Google python binaries are more akin to docker or even vm
| images, even if the actual technology used predates
| docker and even linux VMs. They contain something like a
| slimmed-down linux distribution, not just a binary.
|
| EXTREME predictability (e.g. as never ever using the
| system's libssl), in trade for huge binaries. They go
| pretty damn far in this: you won't catch a Google binary
| even using most of libc.
| jyap wrote:
| It makes "sense" based on the domain of the cloud
| provider being DevOps teams who are maintaining and using
| these CLI tools. Ie. What they use day to day.
|
| For anything more advanced they offer language specific
| SDKs in Rust, Swift, Kolton, etc...
|
| For example integrating storage in an iOS app.
| marcyb5st wrote:
| Did you install all the components? Because if so you also
| installed emulators for the pubsub and big table (maybe
| others, I don't remember) which explain the big footprint.
| dheera wrote:
| > e.g. their C++ code stays away from advanced features
|
| Which honestly is a GOOD thing because it would make it much
| easier for newcomers to ramp up on existing codebases. Most
| people aren't used to working with spaceships and constexprs.
|
| Readability is also far more valuable to a large team than
| efficiency for anything that isn't a number-crunching loop.
| renegade-otter wrote:
| "Externally", no one could possibly beat Google's track record
| of not committing to products before finally killing them. But
| the code was beautiful, though!
| twelfthnight wrote:
| I mean, was Angular ever "beautiful"?
| resource0x wrote:
| Pretty sure it was. A lousy idea might still be implemented
| beautifully under the hood. :-)
| titanomachy wrote:
| I thought the quality was pretty high, largely because there
| were a lot of rails constraining how code should be written.
| Most of the code I dealt with was written using somewhat rigid
| (but generally well-designed) frameworks with programmatically-
| enforced style guides.
|
| Also, most work seemed to involve some balance of junior and
| more experienced people, which helped keep quality higher.
| Outside of Google, I've seen pretty large projects written by
| new grads with little supervision (and on a tight timeline).
| Those codebases can be pretty hairy.
| twelfthnight wrote:
| That honestly does seem like a recipe for good code. And
| sure, there's tons of open source out there of dubious
| quality.
|
| @resource0x in a sibling comment made the point that it's
| possible to write great code even if the program is a flawed
| design. I'm probably conflating those things.
| rokkitmensch wrote:
| The thing that impressed me most about Google was the
| encoding-of-cultural-norms-in-various-CI-jobs.
|
| It lets them extract usable SWE horsepower from pretty much
| anyone who steps inside and at least tries to be useful and
| not just coast. They can ingest a startup engineer, someone
| who's been a mid-tier enterprise codemonkey, yr mythical
| 10xer, the whole statistical gamut.
| danans wrote:
| > Haven't worked at Google, anyone else share this sentiment?
|
| I worked there, and the quality is definitely much higher and
| the code tends to be far more maintainable. However, there is
| often a cost for that, which is velocity.
|
| Some of this is reduced by the sheer amount of automation in
| tooling (i.e. bots that block style violations and common bugs
| before a code change is submitted).
|
| In other cases, it slows things down quite a bit.
| ein0p wrote:
| A recent ex-googler here: quality of Google3 in general is
| pretty good, but the LLM training bits are so abysmal that I
| know people who have resigned instead of working on it. And
| it's also extra slow because getting a couple local GPUs is not
| really an option. So you're forced to "develop in Colab" which
| works for some things and not for others and in general sucks
| ass if you're working on anything substantial. For anything
| more substantial you'll be launching stuff on some resource
| pool, waiting for like 10-15 minutes until it starts (much
| longer for large models), and then trying to divine why it
| failed from voluminous and sometimes indecipherable crash logs
| which also hang your browser when cluster UI tries to load
| them.
|
| Rumors of Google's AI code superiority are vastly overblown in
| 2024. I'm currently at another major AI lab, and the code here
| can actually be understood and worked on, which I consider to
| be a massive advantage.
| alsoworkedthere wrote:
| Finally, an accurate portrayal!
|
| Google has superb robustness and code quality, with garbage-
| level usability. Once you're setup, you can kick off many
| massive training jobs and compare results easily. However,
| getting to that point is really hard. You'll never figure out
| how to use the ML infrastructure and libraries on your own.
| You can only get it to work by meeting with the teams that
| wrote the infra so they can find and fix every error and
| misconfiguration. Usually, there is one single way to get
| things working together, and neither the documentation nor
| the error messages will get you to that brittle state.
|
| It's near impossible to get a VM with a TPU or GPU attached,
| so there's no way to debug issues that happen between the
| library and the accelerator. Plus somehow they've made Python
| take longer to build (??!!) and run than C++ takes, so your
| iteration cycle is several minutes for what would take
| seconds at any other place. Fun stuff! Somehow it's still one
| of the best places to do ML work, but they sure try to make
| it as difficult as possible.
| ein0p wrote:
| Google doesn't use VMs internally to run workloads. But
| yeah, seconds-long dev iteration cycles take minutes or
| even tens of minutes there.
| bo1024 wrote:
| This is very interesting, but I really want to hear about the
| training data process!
| stealthcat wrote:
| Should list most of the technical debt accumulated so far and
| rank them. At this stage, lots of corners have been cut.
| LZ_Khan wrote:
| I wish I knew how to do yolo runs.
|
| - signed, a compute resource hog at FAANG
| planet_y wrote:
| I'm wondering if the title should read "from the ground up"
| instead of "ground zero"?
| https://en.wikipedia.org/wiki/Hypocenter
| zer00eyz wrote:
| https://www.merriam-webster.com/dictionary/ground%20zero
|
| It is a perfectly acceptable use of the idiom.
| davidmurdoch wrote:
| Acceptable, but maybe not perfectly.
| dotancohen wrote:
| Yes, the title sounds like somebody confused two idioms. That's
| not the type of author from whom I want to learn.
| frozenseven wrote:
| 1. As others have pointed out, it's a perfectly valid idiom.
| Check a dictionary.
|
| 2. How do you think idioms are created in the first place?
|
| 3. What exactly forces you to act like this?
| makoto12 wrote:
| could be intentional. Implying LLMs are a proverbial nuclear
| bomb to the tech landscape. but honestly it threw me as well
| julianh65 wrote:
| So which compute providers have folks had a good experience with?
| hackerlight wrote:
| > In the end it took us only a very small number of smaller scale
| & shorter ablation runs to get to the strong 21B Reka Flash and
| 7B edge model (and also our upcoming largest core model). Finding
| a solid recipe with a very limited number of runs is challenging
| and requires changing many variables at once given the
| ridiculously enormous search space. In order to do this, one has
| to abandon the systematicity of Bigtech and rely a lot on "Yolo",
| gut feeling and instinct.
|
| > Thankfully, I (and many of us in the team) have built up this
| intuition quite a bit in our ML careers to get it right within a
| substantially short amount of tries. While we've trained really
| good models before in our previous jobs, differences in training
| infrastructure, data, incorporation of new ideas and other
| environmental issues can still cause non-trivial differences in
| outcomes. That said, a strong prior helps to significantly cut
| down the search space and is probably one of the easiest
| explanations to why we were able to train really strong models
| with so few trials, resources and experimentation.
| a_bonobo wrote:
| But what is the product they're selling?
|
| The main Reka.AI page looks like a regular ChatGPT clone, an LLM
| you pay for by the token. How is this different from all these
| other companies? Pricing seems to be comparable to ChatGPT
| 3.5-Turbo.
| polygamous_bat wrote:
| Perhaps a cure for venture capitalist FOMO for not having
| invested in AI?
| classified wrote:
| Absorbing the risk of copyright and license violations en masse
| for the training data as a service?
| TrackerFF wrote:
| Big question is, how do small startups manage to get funding for
| LLM products if they don't have the "correct" background /
| pedigree?
|
| The world of LLM startups is beginning to look like the world of
| hedge funds and private equity firms - where the prerequisites
| for seed/funding are:
|
| A) Prestigious employment history / correct pedigree.
|
| B) Solid network of investors ready to jump before any product
| has even begun.
| rvz wrote:
| Then what happens when the LLM or AI performs worse than
| expected? Spend more money fine tuning?
|
| By the time you get it all working, not only you've spend lots of
| your VC capital on training alone, your competitors (Google,
| Meta, etc) already released a more powerful model much better and
| quicker than you before you could your run the second training
| epoch.
|
| Another example of a startup incinerating the VC pump and dump
| scheme for vaporware AI snake-oil.
| tkgally wrote:
| I learned about reka.ai from this post; their LLMs don't seem to
| have been discussed much on HN yet [1]. So, out of curiosity, I
| spent the last hour testing prompts with their chat interface [2]
| in comparison with ChatGPT 4, Gemini Advanced, Claude 3, and
| Mistral Large. I put the results at [3]. Overall, Reka Flash
| doesn't seem significantly worse or better than the others. A lot
| more testing would be necessary to be sure, of course.
|
| [1]
| https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
|
| [2] https://chat.reka.ai/chat
|
| [3] https://gally.net/temp/20240307llmcomparison.html
| egberts1 wrote:
| TL:DR: LLM training is highly susceptible to GIGO.
|
| (GIGO is what one gets when feeding LLM with "G"arbage "I"n,
| "G"arbage "O"ut.)
|
| This is the current problem about making a vaccine signature
| fitting like a glove ... as tight as possible ... when populating
| the anti-malware (i.e. IDS/IPS/NDS/XNS) search pattern engine for
| use by Aho-Corasick-variant algorithms (such as Parallel-
| Failureless Aho Corasick).
|
| However, LLM as a binary code-based detector for malware
| detection has a very limited benefit (it is there but only as a
| backend topical add-on after all other conditionals have been
| identified).
|
| LLM lacks qualifying conditionals surrounding a premise data, and
| I have my doubts of using LLM for medical diagnosis as well:
| until we start having LLM denote the much-needed weighted combo-
| conditionals by "percentages".
___________________________________________________________________
(page generated 2024-03-07 23:02 UTC)