[HN Gopher] Training LLMs from ground zero as a startup
       ___________________________________________________________________
        
       Training LLMs from ground zero as a startup
        
       Author : swyx
       Score  : 100 points
       Date   : 2024-03-05 22:31 UTC (1 days ago)
        
 (HTM) web link (www.yitay.net)
 (TXT) w3m dump (www.yitay.net)
        
       | swyx wrote:
       | for context Yi Tay was Tech Lead on Google PaLM, UL2, Flan, Bard,
       | etc and now is cofoudner at Reka (which has shipped some v
       | interesting small multimodal models that have featured on here).
       | I prompted him for this post as an ex-Googler now training LLMs
       | as an independent startup
       | https://twitter.com/YiTayML/status/1765105066263052718
       | 
       | our conversation was recorded here
       | https://sub.thursdai.news/p/thursdai-feb-15-2024-openai-chan...
        
         | swyx wrote:
         | (update: i submitted this yesterday and it didnt get traction,
         | i guess @dang must've merged the old submission in here. you
         | really didnt have to, but its a nice gesture. thanks dang!!)
        
           | axpy906 wrote:
           | Great too see you on here. Love Latent Space podcast.
        
       | pama wrote:
       | Training LLM from scratch is a super important issue that affects
       | the pace and breadth of iteration of AI almost as much as the raw
       | hardware improvements do. The blog is fun but somewhat shallow
       | and not technical or very surprising if you've worked with
       | clusters of GPUs in any capacity over the years. (I liked the
       | perspective of a former googler, but I'm not sure why past
       | colleagues would recommend Jax over pytorch for LLMs outside of
       | Google.) I hope this newco eventually releases a more technical
       | report about their training adventures, like the PDF file here:
       | https://github.com/facebookresearch/metaseq/tree/main/projec...
        
         | axpy906 wrote:
         | If you're doing research JAX makes some sense. Probably some
         | Google bias in there too.
        
           | lyapunova wrote:
           | To be honest, most researchers in applied ML in the bay say
           | the opposite. If you are trying to be nimble and prototype,
           | use pytorch. If you're trying to gain some optimizations as
           | you near deployment, rewrite in Jax.
        
       | abeppu wrote:
       | It's worth taking a second to note that the author just assumes
       | that readers understand "the wilderness" to mean "not Google".
       | 
       | This post gives a lot of credit to Google's infra and hardware
       | teams, and I'd love to read a perspective from one of those
       | insiders who then went on to do related work elsewhere.
        
       | yalok wrote:
       | > All in all, this is only a small part of the story of how we
       | started a company, raised some money, bought some chips and
       | matched Gemini pro/GPT 3.5 and outperformed many others in less
       | than a year having to build everything from scratch.
       | 
       | I wonder what was the budget spent for the chips/cloud GPUs to
       | achieve GPT 3.5 level LLM - at least in the order to magnitude -
       | 2-5 millions?
        
       | joe_the_user wrote:
       | So essentially a startup in this context has a small number of
       | people and a large amount of money for training clusters. The
       | article describes many operation leasing servers - that you
       | assume to go many startups (or existing firms).
       | 
       | So it seems like you have the various LLM creators all doing
       | roughly the same sort of thing (training with text and image
       | data) with similar hardware and similar data. Each of these
       | naturally has their own brand of "secret sauce" for
       | distinguishing their venture. The various secret sauces can make
       | a difference in the quality of an LLM's output.
       | 
       | Yet overall, this seems like a massive, energy intensive exercise
       | in redundancy.
        
         | dauertewigkeit wrote:
         | I don't think most of them have any kind of secret sauce. I
         | think the founders hope to get bought out simply for being able
         | to train "near-SOTA" LLMs. I guess achieving that level of
         | skill and infra could be valuable enough to build upon.
        
           | joe_the_user wrote:
           | Sure, that's also a factor but I'd say it reinforces my main
           | point.
        
       | twelfthnight wrote:
       | > To be very frank, I would have to say the quality of codebases
       | externally significantly lag behind those I've been used to at
       | Google
       | 
       | Haven't worked at Google, anyone else share this sentiment? I
       | always feel like working with Google code is typically not
       | idiomatic and super difficult to go "under the hood" if anything
       | isn't precisely on the happy path.
        
         | winwang wrote:
         | (not googler)
         | 
         | Google's codebase is idiomatic to Google due to their strict
         | language tooling. e.g. their C++ code stays away from advanced
         | features. The tooling teams at Google have very strong say.
        
           | twelfthnight wrote:
           | I get that sense too. Probably does work awesome if you're
           | inside. But man it's a mess when they externalize stuff. Just
           | one example: their cloud platform CLI includes an entire
           | python installation and takes 1.7G on disk, just to make API
           | calls...
        
             | jen20 wrote:
             | I have never understood why cloud providers seem to think
             | it is OK to write their CLIs in Python. The AWS one is too,
             | and the Azure one went from Node.js to Python some time
             | ago.
        
               | anonymous-panda wrote:
               | Packaging and stability reasons. Same for why it's a
               | 1.7gb install - probably where they landed after having
               | tons of support issues on some random Python version they
               | didn't test or some issue with a dependency that had that
               | issue. Freezing the entire set of artifacts is more
               | stable and Python lets you move pretty quick. I can't
               | speak to why nodejs vs Python though - maybe Python is
               | easier to embed?
        
               | pests wrote:
               | What? They only get package and stability because they
               | include the runtime. If they just went with a compiled
               | language they could distribute native binaries and have
               | actual packaging and stability.
        
               | anonymous-panda wrote:
               | Yes, but it's not just a single metric. Another is how
               | easy it is for them to hire productive members of the
               | team and how much that costs them - middling Python
               | developers churning out fine"ish" code are cheaper than
               | Rust developers doing the same. It's hard to find a
               | language where you can be as productive as a developer in
               | Python that also has AOT compilation to generate
               | standalone binaries.
               | 
               | Tldr: there's multiple factors to consider here and it's
               | more interesting to understand the pressures that cause
               | the decisions, especially if you want to try to create a
               | world where different decisions are made.
        
               | twelfthnight wrote:
               | Yeah, I imagine that was the decision calculus. "Instead
               | of spending some more effort to save millions of
               | unnecessary downloads of python's runtime using a
               | different language, let's just bundle Python!"
               | 
               | I wouldn't be surprised if it was version 2.7 too...
        
               | twelfthnight wrote:
               | There probably is a sense in which the API's are
               | constantly changing, so maybe an interpreted language
               | might make sense? I imagine there has to be a better way
               | to do with with Go or Rust though (even lua?) for a
               | smaller binary.
        
               | jyap wrote:
               | It makes "sense" based on the domain of the cloud
               | provider being DevOps teams who are maintaining and using
               | these CLI tools. Ie. What they use day to day.
               | 
               | For anything more advanced they offer language specific
               | SDKs in Rust, Swift, Kolton, etc...
               | 
               | For example integrating storage in an iOS app.
        
             | marcyb5st wrote:
             | Did you install all the components? Because if so you also
             | installed emulators for the pubsub and big table (maybe
             | others, I don't remember) which explain the big footprint.
        
           | dheera wrote:
           | > e.g. their C++ code stays away from advanced features
           | 
           | Which honestly is a GOOD thing because it would make it much
           | easier for newcomers to ramp up on existing codebases. Most
           | people aren't used to working with spaceships and constexprs.
           | 
           | Readability is also far more valuable to a large team than
           | efficiency for anything that isn't a number-crunching loop.
        
         | renegade-otter wrote:
         | "Externally", no one could possibly beat Google's track record
         | of not committing to products before finally killing them. But
         | the code was beautiful, though!
        
           | twelfthnight wrote:
           | I mean, was Angular ever "beautiful"?
        
             | resource0x wrote:
             | Pretty sure it was. A lousy idea might still be implemented
             | beautifully under the hood. :-)
        
         | titanomachy wrote:
         | I thought the quality was pretty high, largely because there
         | were a lot of rails constraining how code should be written.
         | Most of the code I dealt with was written using somewhat rigid
         | (but generally well-designed) frameworks with programmatically-
         | enforced style guides.
         | 
         | Also, most work seemed to involve some balance of junior and
         | more experienced people, which helped keep quality higher.
         | Outside of Google, I've seen pretty large projects written by
         | new grads with little supervision (and on a tight timeline).
         | Those codebases can be pretty hairy.
        
           | twelfthnight wrote:
           | That honestly does seem like a recipe for good code. And
           | sure, there's tons of open source out there of dubious
           | quality.
           | 
           | @resource0x in a sibling comment made the point that it's
           | possible to write great code even if the program is a flawed
           | design. I'm probably conflating those things.
        
         | danans wrote:
         | > Haven't worked at Google, anyone else share this sentiment?
         | 
         | I worked there, and the quality is definitely much higher and
         | the code tends to be far more maintainable. However, there is
         | often a cost for that, which is velocity.
         | 
         | Some of this is reduced by the sheer amount of automation in
         | tooling (i.e. bots that block style violations and common bugs
         | before a code change is submitted).
         | 
         | In other cases, it slows things down quite a bit.
        
       | bo1024 wrote:
       | This is very interesting, but I really want to hear about the
       | training data process!
        
       ___________________________________________________________________
       (page generated 2024-03-06 23:00 UTC)