[HN Gopher] Reproducing GPT-2 in llm.c
       ___________________________________________________________________
        
       Reproducing GPT-2 in llm.c
        
       Author : tosh
       Score  : 310 points
       Date   : 2024-05-28 15:58 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | indigodaddy wrote:
       | Looks like this is re: training, but wonder how inference would
       | be on some garbage older machine with no GPU on this model?
        
         | ryankrage77 wrote:
         | Last time I tried GPT-2 on CPU (which I think was shortly
         | before chatGPT was launched), I was getting about 0.2
         | tokens/sec. CPU utilization was low though, so running
         | inference in parralel gave better results. I was using 2 x
         | E5-2660's.
        
           | int_19h wrote:
           | DDR5 helps a lot. You can actually run stuff like LLaMA at >1
           | tok/s on the CPU with high-end gaming hardware these days.
        
       | benterix wrote:
       | I just hope than in a couple of years we'll see a submission here
       | titled "Reproduce GPT-4 on legacy RTX 4090."
       | 
       | Because currently even with open source (?) models we are still
       | consumers, and the training is still the domain of the rich.
        
         | ravetcofx wrote:
         | Accessing the dataset to train from scratch will be the biggest
         | hurdle, now a lot of the pile has had ladder pulled since GPT-4
        
           | CamperBob2 wrote:
           | Someone will come along and say "Why don't you just mirror
           | Anna's Archive?" in 3...2...1...
        
           | exe34 wrote:
           | i suppose you wouldn't be able to use it for external
           | services, but internally, I'm sure you can find some books
           | that fell off the back of a truck...
        
             | HeatrayEnjoyer wrote:
             | No reason you can't go external. GPT was trained using
             | ebook torrent sites
        
               | artninja1988 wrote:
               | OpenAI has enough money to hire lawyers to defend it
               | until the end of time though
        
           | meiraleal wrote:
           | I'm okay with paying for datasets
        
             | CamperBob2 wrote:
             | Depends on how the courts rule. If the copyright
             | maximalists prevail, only the wealthiest entities will be
             | able to afford to license a useful data set.
             | 
             | Paradoxically enough, this is the outcome that most "Hacker
             | News" denizens seem to be rooting for.
        
               | meiraleal wrote:
               | I'd still get most of my dataset from torrent but I could
               | pay for specific things like high quality source code.
        
               | groby_b wrote:
               | It's almost as if people believe in fairness and
               | compensating people for their work.
               | 
               | Also, it's worth noting that this is only true as long as
               | we're stuck in the "must train on the entire sum total of
               | human output ever created" local minimum for machine
               | learning. Given that most biological entities learn with
               | much less data, this might well be the thing that prods
               | ML research to using an approach that isn't "IDK, buy a
               | few containers of GPUs, and half a DC of storage, see if
               | that makes things better".
        
           | GaggiX wrote:
           | https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T
           | cleaned and deduplicated english web data tokens.
        
             | ravetcofx wrote:
             | Holy crap, Does huggingface charge for bandwidth if you're
             | downloading 45 terabytes??
        
               | drexlspivey wrote:
               | I believe they are hosting it on Cloudflare who doesn't
               | charge for egress
        
               | fragmede wrote:
               | More specifically, Cloudflare R2 doesn't charge for
               | egress, and Cloudflare doesn't charge for egress to
               | members in the Bandwidth Alliance which include Azure,
               | Google Cloud, Oracle, Alibaba Cloud, and others, though
               | critically not AWS.
               | 
               | They very much do charge egress fees elsewhere.
        
         | vineyardmike wrote:
         | We won't ever get there or need to because GPT-4 wasn't trained
         | on one GPU it was trained on thousands. The (most likely)
         | biggest meaningful difference between -2 and -4 is the number
         | of parameters and the training data/duration. I don't think
         | you'd really learn much more.
        
         | Invictus0 wrote:
         | I'm not saying this to be rude, but I think you have a deep
         | misunderstanding of how AI training works. You cannot just skip
         | the matrix multiplications necessary to train the model, or get
         | current hardware to do it faster.
        
           | xdavidliu wrote:
           | was the first sentence really necessary? The second sentence
           | seems fine by itself.
        
         | auspiv wrote:
         | Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I
         | think it'll take far more than a single 4090.
        
           | bufo wrote:
           | The RTX 4090 has about the same BF16 Tensor Core TOPs than
           | the A100, assuming 50% MFU (like the A100 40 GB PCIe) it
           | would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12
           | hours. Datasheet here for the TOPs
           | https://images.nvidia.com/aem-
           | dam/Solutions/geforce/ada/nvid... 50% MFU should be
           | achievable on the 4090.
        
         | anthonix1 wrote:
         | FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900
         | XTX machine (less than $4k worth of GPU), using the same
         | settings as in the post (0.5M batch size etc).
        
           | pama wrote:
           | Did you reproduce the evaluation as well?
        
         | sabareesh wrote:
         | Well here is a comment on 4090
         | https://github.com/karpathy/llm.c/discussions/481#discussion...
        
       | karpathy wrote:
       | Hi HN the main (more detailed) article is here
       | https://github.com/karpathy/llm.c/discussions/481
       | 
       | Happy to answer questions!
        
         | 1024core wrote:
         | Thank you, from an appreciative reader!
        
         | ngiyabonga wrote:
         | Hi Andrej!
         | 
         | First, thank you for your teaching, it has helped me a lot,
         | didn't think I'd ever have the chance to say thank you, but
         | here you are and I hope this gets to you!
         | 
         | Question - what's a relevant (05-2024) baseline to compare the
         | performance of c code to? Back when you made nanoGPT you were
         | seeing "the file train.py reproduces GPT-2 (124M) on
         | OpenWebText, running on a single 8XA100 40GB node in about 4
         | days of training". So twice the memory on the c node, but
         | unsure of data size /epochs, any other details I may be
         | missing. I.e. what's the net uplift of running c vs "legacy"
         | torch code?
         | 
         | Thanks again for everything.
        
           | karpathy wrote:
           | The baseline is definitely PyTorch (or JAX), and indeed
           | something like nanoGPT. I just never got nanoGPT "past the
           | finish line" of really crossing the t's and dotting the i's
           | and reproducing the models with as much care as I did now and
           | here in llm.c, and getting to the point where it's a single
           | launch command that just does the thing.
           | 
           | I think I'll try to develop the `train_gpt2.py` inside llm.c
           | to be that, so that we have the two implementations exactly
           | side by side, and it's all nice and comparable.
           | 
           | The C/CUDA code is currently a little bit faster than PyTorch
           | (last time I measured ~2 weeks ago it was about 6% faster),
           | and I think we can push this further. This is done by
           | manually hard-coding a bunch of fusions/optimizations that
           | are non-trivial for torch.compile to find (e.g. our
           | FusedClassifier). But PyTorch has some pending work/PRs that
           | will also speed up their side a lot.
           | 
           | Ultimately my interest in llm.c is to have a nice, clean,
           | minimal, super dependency-light repo in direct C/CUDA
           | implementation, which I find aesthetically pleasing. And on
           | top of that, educational, i.e. using all of the above as an
           | endpoint of an intro LLM course.
        
             | ilaksh wrote:
             | Just out of curiosity, how do you feel about Tinygrad? They
             | just released 0.9 and are also on the HN home page today.
        
             | raymond_goo wrote:
             | Maybe talk to MasterClass...
        
         | sturza wrote:
         | Do you think grokking leads to proper generalized reasoning?
         | https://arxiv.org/abs/2405.15071
        
           | bilsbie wrote:
           | Any tips on understanding grokking? I'm not following that
           | paper.
        
             | sturza wrote:
             | Grokking: Generalization Beyond Overfitting on Small
             | Algorithmic Datasets. Overfitting and being cool about it
             | and some new behavior might emerge.
        
         | espadrine wrote:
         | How big of a perf improvement would result from using the
         | architectural tweaks that Llama3 and others have put in place
         | since GPT-2?
        
           | karpathy wrote:
           | My understanding and suspicion is mostly less than you think.
           | Llama 3 architecture has the following changes on GPT-2:
           | 
           | 1. delete the absolute positional encoding and replace with
           | RoPE
           | 
           | 2. delete all biases in all layers (in LayerNorms, they turn
           | into RMSNorm)
           | 
           | 3. GeLU -> SwiGLU non-linearity in the MLP
           | 
           | 4. longer context length
           | 
           | 5. architecture hyperparameter changes, e.g. slightly
           | different aspect ratios
           | 
           | And there was a paper that I can't find the reference to
           | anymore that claimed that if you train long enough, the gap
           | becomes even lower. Possibly because the absolutely
           | positional encoding has enough time to train more fully,
           | where as the RoPE layer benefits from the "inductive bias" it
           | adds in the earlier stages of training.
           | 
           | But I don't have full confidence on the above claim, maybe
           | someone has tried or has better/concrete reference.
        
             | jorlow wrote:
             | Note llama's feed forward is a bit different too:
             | self.w2(F.silu(self.w1(x)) * self.w3(x))
             | 
             | I.e. the nonlinearity is a gate.
             | 
             | https://github.com/meta-
             | llama/llama3/blob/14aab0428d3ec3a959...
        
         | lagrange77 wrote:
         | Thank you for the effort you put in your educational work, it
         | helped me and others a lot! In fact, i'm training my nanoGPT
         | version right now. :)
         | 
         | > Ultimately my interest in llm.c is to have a nice, clean,
         | minimal, super dependency-light repo in direct C/CUDA
         | implementation, which I find aesthetically pleasing.
         | 
         | Also, it's awesome that you spend your time on your passion.
         | 
         | Any plans on making a video series on llm.c? :D
        
           | karpathy wrote:
           | Yes definitely. Related tweet of mine:
           | 
           | https://x.com/karpathy/status/1760388761349927356?lang=en
           | 
           | 1. Build the thing
           | 
           | 2. Build the ramp
           | 
           | Currently on step 1 :). It helps to build it first so you
           | know where you are going, and then you can more easily re-
           | build it when you're vector pointed at the end result.
        
             | lagrange77 wrote:
             | That's fantastic. My gradient field is pointing towards it.
             | 
             | Thank you again!
        
             | htrp wrote:
             | Everytime you take gardening leave, you build something new
             | and interesting!
        
         | 363849473754 wrote:
         | You might have covered this topic before, but I'm curious about
         | the main performance differences between nanoGPT and llm.c. I'm
         | planning to take your "Zero to Hero" course, and I'd like to
         | know how capable the nanoGPT chatbot you'll build is. Is its
         | quality comparable to GPT-2 when used as a chatbot?
        
           | karpathy wrote:
           | Zero To Hero doesn't make it all the way to a chatbot, it
           | stops at pretraining, and even that at a fairly small scale
           | or character-level transformer on TinyShakespeare. I think
           | it's a good conceptual intro but you don't get too too far as
           | a competent chatbot. I think I should be able to improve on
           | this soon.
        
             | 363849473754 wrote:
             | Thanks! So, you are considering expanding the Zero to Hero
             | series to include building a basic GPT-2 toy chatbot? I
             | believe you mentioned in one of the early lectures that you
             | planned to include building a toy version of Dalle. Do you
             | still have plans for that as well?
        
             | maskil wrote:
             | Please do! It's a fantastic series!
        
         | dang wrote:
         | Ok, we've changed the URL to that from
         | https://twitter.com/karpathy/status/1795484547267834137 above.
         | Thanks!
        
           | karpathy wrote:
           | sounds good. both work, (though) I think HN has a bit of an
           | anti-twitter bias.
        
             | pests wrote:
             | First, love the videos and other work you've been doing.
             | The micrograd videos are a great way to show people this is
             | all math in the end, and I've linked to specific timestamps
             | in that video and others more times than I can count.
             | 
             | For why I think we have a anti-twitter bias...
             | 
             | Twitter doesn't show replies or any further context without
             | being logged in. Most people will have accounts but I know
             | a lot here deleted theirs or refuse to use it for one
             | reason or another.
             | 
             | Also IMO most here are going to want to read the full
             | source so it just cuts out the middleman. This would
             | usually fall under the "Please submit the original source.
             | If a post reports on something found on another site,
             | submit the latter." guideline which is a little different
             | since the source is yourself, but still the Twitter post
             | doesn't add anything new or novel.
        
         | m11a wrote:
         | Why write in CUDA and not just use PyTorch etc?
         | 
         | if performance, how much faster is it, out of curiosity?
        
           | kgwgk wrote:
           | > Why write in CUDA and not just use PyTorch etc?
           | 
           | "LLM training in simple, pure C/CUDA. There is no need for
           | 245MB of PyTorch or 107MB of cPython. [...] A few more words
           | on what I want this repo to be: First, I want llm.c to be a
           | place for education."
        
         | simonw wrote:
         | > Keep in mind that here we trained for 10B tokens, while GPT-3
         | models were all trained for 300B tokens. [...] GPT-3 actually
         | didn't change too much at all about the model (context size
         | 1024 -> 2048, I think that's it?).
         | 
         | Andrej, based on that do you have a rough cost estimate for
         | what it would take to train a GPT-3 Ada (350M)? Do you plan to
         | get there with llm.c ?
        
           | karpathy wrote:
           | The 350M model I trained last night was 30B tokens, 14 hours,
           | ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K
           | would be the estimate. You'd have to wait 140 hours on one
           | box though. Getting an H100 box instead of A100 will already
           | cut the time latency down probably by a factor of 2-3X, for
           | free, even without going to fp8 (which we do plan to
           | support).
           | 
           | So TLDR at this model scale, llm.c is already there
           | functionally, I think, it's a matter of the compute resources
           | and patience. I currently have this one box from Lambda and I
           | have to look around for a few more boxes and merge the
           | pending PR for multi-node training support. Getting all of
           | this into a nice, stable state is probably a good chunk of
           | the pending work right now.
        
         | localhost wrote:
         | How large is the set of binaries needed to do this training
         | job? The current pytorch + CUDA ecosystem is so incredibly
         | gigantic and manipulating those container images is painful
         | because they are so large. I was hopeful that this would be the
         | beginnings of a much smaller training/fine-tuning stack?
        
           | karpathy wrote:
           | That is 100% my intention and hope and I think we are very
           | close to deleting all of that. Right now on master, I am
           | already only using Python for the tokenization preprocessing.
           | In principle the requirements for llm.c should be extremely
           | minimal. I think this a few days of work that is high on my
           | mind.
           | 
           | Biggest problem right now is finding a place that can host
           | the 135GB of tokens for FineWeb100B. Will probably use S3 or
           | something.
           | 
           | Related see: https://github.com/karpathy/llm.c/issues/482
        
       | notg963 wrote:
       | Do you have plans to create videos for the llm.c?
        
       | anoy8888 wrote:
       | Can it be done in rust ?
        
       | celltalk wrote:
       | Time for llm videos!
        
       | natsucks wrote:
       | In your opinion is it important for ML engineers to know C?
        
         | brcmthrowaway wrote:
         | 0% chance
        
         | esafak wrote:
         | You'd have to be deep into ML Infrastructure to use C, probably
         | via CUDA. No-one who develops or uses ML models touches C or
         | even C++. tinygrad and llama.cpp are exceptions.
        
       ___________________________________________________________________
       (page generated 2024-05-28 23:00 UTC)