[HN Gopher] Reproducing GPT-2 in llm.c
___________________________________________________________________
Reproducing GPT-2 in llm.c
Author : tosh
Score : 310 points
Date : 2024-05-28 15:58 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| indigodaddy wrote:
| Looks like this is re: training, but wonder how inference would
| be on some garbage older machine with no GPU on this model?
| ryankrage77 wrote:
| Last time I tried GPT-2 on CPU (which I think was shortly
| before chatGPT was launched), I was getting about 0.2
| tokens/sec. CPU utilization was low though, so running
| inference in parralel gave better results. I was using 2 x
| E5-2660's.
| int_19h wrote:
| DDR5 helps a lot. You can actually run stuff like LLaMA at >1
| tok/s on the CPU with high-end gaming hardware these days.
| benterix wrote:
| I just hope than in a couple of years we'll see a submission here
| titled "Reproduce GPT-4 on legacy RTX 4090."
|
| Because currently even with open source (?) models we are still
| consumers, and the training is still the domain of the rich.
| ravetcofx wrote:
| Accessing the dataset to train from scratch will be the biggest
| hurdle, now a lot of the pile has had ladder pulled since GPT-4
| CamperBob2 wrote:
| Someone will come along and say "Why don't you just mirror
| Anna's Archive?" in 3...2...1...
| exe34 wrote:
| i suppose you wouldn't be able to use it for external
| services, but internally, I'm sure you can find some books
| that fell off the back of a truck...
| HeatrayEnjoyer wrote:
| No reason you can't go external. GPT was trained using
| ebook torrent sites
| artninja1988 wrote:
| OpenAI has enough money to hire lawyers to defend it
| until the end of time though
| meiraleal wrote:
| I'm okay with paying for datasets
| CamperBob2 wrote:
| Depends on how the courts rule. If the copyright
| maximalists prevail, only the wealthiest entities will be
| able to afford to license a useful data set.
|
| Paradoxically enough, this is the outcome that most "Hacker
| News" denizens seem to be rooting for.
| meiraleal wrote:
| I'd still get most of my dataset from torrent but I could
| pay for specific things like high quality source code.
| groby_b wrote:
| It's almost as if people believe in fairness and
| compensating people for their work.
|
| Also, it's worth noting that this is only true as long as
| we're stuck in the "must train on the entire sum total of
| human output ever created" local minimum for machine
| learning. Given that most biological entities learn with
| much less data, this might well be the thing that prods
| ML research to using an approach that isn't "IDK, buy a
| few containers of GPUs, and half a DC of storage, see if
| that makes things better".
| GaggiX wrote:
| https://huggingface.co/datasets/HuggingFaceFW/fineweb has 15T
| cleaned and deduplicated english web data tokens.
| ravetcofx wrote:
| Holy crap, Does huggingface charge for bandwidth if you're
| downloading 45 terabytes??
| drexlspivey wrote:
| I believe they are hosting it on Cloudflare who doesn't
| charge for egress
| fragmede wrote:
| More specifically, Cloudflare R2 doesn't charge for
| egress, and Cloudflare doesn't charge for egress to
| members in the Bandwidth Alliance which include Azure,
| Google Cloud, Oracle, Alibaba Cloud, and others, though
| critically not AWS.
|
| They very much do charge egress fees elsewhere.
| vineyardmike wrote:
| We won't ever get there or need to because GPT-4 wasn't trained
| on one GPU it was trained on thousands. The (most likely)
| biggest meaningful difference between -2 and -4 is the number
| of parameters and the training data/duration. I don't think
| you'd really learn much more.
| Invictus0 wrote:
| I'm not saying this to be rude, but I think you have a deep
| misunderstanding of how AI training works. You cannot just skip
| the matrix multiplications necessary to train the model, or get
| current hardware to do it faster.
| xdavidliu wrote:
| was the first sentence really necessary? The second sentence
| seems fine by itself.
| auspiv wrote:
| Considering it takes 8x A100 GPUs (80GB VRAM) to train GPT-2, I
| think it'll take far more than a single 4090.
| bufo wrote:
| The RTX 4090 has about the same BF16 Tensor Core TOPs than
| the A100, assuming 50% MFU (like the A100 40 GB PCIe) it
| would take 8x longer on 1 RTX 4090 vs 8x A100 80GB SXM, so 12
| hours. Datasheet here for the TOPs
| https://images.nvidia.com/aem-
| dam/Solutions/geforce/ada/nvid... 50% MFU should be
| achievable on the 4090.
| anthonix1 wrote:
| FWIW, I'm seeing ~318,000 toks/sec throughput on a 4x AMD 7900
| XTX machine (less than $4k worth of GPU), using the same
| settings as in the post (0.5M batch size etc).
| pama wrote:
| Did you reproduce the evaluation as well?
| sabareesh wrote:
| Well here is a comment on 4090
| https://github.com/karpathy/llm.c/discussions/481#discussion...
| karpathy wrote:
| Hi HN the main (more detailed) article is here
| https://github.com/karpathy/llm.c/discussions/481
|
| Happy to answer questions!
| 1024core wrote:
| Thank you, from an appreciative reader!
| ngiyabonga wrote:
| Hi Andrej!
|
| First, thank you for your teaching, it has helped me a lot,
| didn't think I'd ever have the chance to say thank you, but
| here you are and I hope this gets to you!
|
| Question - what's a relevant (05-2024) baseline to compare the
| performance of c code to? Back when you made nanoGPT you were
| seeing "the file train.py reproduces GPT-2 (124M) on
| OpenWebText, running on a single 8XA100 40GB node in about 4
| days of training". So twice the memory on the c node, but
| unsure of data size /epochs, any other details I may be
| missing. I.e. what's the net uplift of running c vs "legacy"
| torch code?
|
| Thanks again for everything.
| karpathy wrote:
| The baseline is definitely PyTorch (or JAX), and indeed
| something like nanoGPT. I just never got nanoGPT "past the
| finish line" of really crossing the t's and dotting the i's
| and reproducing the models with as much care as I did now and
| here in llm.c, and getting to the point where it's a single
| launch command that just does the thing.
|
| I think I'll try to develop the `train_gpt2.py` inside llm.c
| to be that, so that we have the two implementations exactly
| side by side, and it's all nice and comparable.
|
| The C/CUDA code is currently a little bit faster than PyTorch
| (last time I measured ~2 weeks ago it was about 6% faster),
| and I think we can push this further. This is done by
| manually hard-coding a bunch of fusions/optimizations that
| are non-trivial for torch.compile to find (e.g. our
| FusedClassifier). But PyTorch has some pending work/PRs that
| will also speed up their side a lot.
|
| Ultimately my interest in llm.c is to have a nice, clean,
| minimal, super dependency-light repo in direct C/CUDA
| implementation, which I find aesthetically pleasing. And on
| top of that, educational, i.e. using all of the above as an
| endpoint of an intro LLM course.
| ilaksh wrote:
| Just out of curiosity, how do you feel about Tinygrad? They
| just released 0.9 and are also on the HN home page today.
| raymond_goo wrote:
| Maybe talk to MasterClass...
| sturza wrote:
| Do you think grokking leads to proper generalized reasoning?
| https://arxiv.org/abs/2405.15071
| bilsbie wrote:
| Any tips on understanding grokking? I'm not following that
| paper.
| sturza wrote:
| Grokking: Generalization Beyond Overfitting on Small
| Algorithmic Datasets. Overfitting and being cool about it
| and some new behavior might emerge.
| espadrine wrote:
| How big of a perf improvement would result from using the
| architectural tweaks that Llama3 and others have put in place
| since GPT-2?
| karpathy wrote:
| My understanding and suspicion is mostly less than you think.
| Llama 3 architecture has the following changes on GPT-2:
|
| 1. delete the absolute positional encoding and replace with
| RoPE
|
| 2. delete all biases in all layers (in LayerNorms, they turn
| into RMSNorm)
|
| 3. GeLU -> SwiGLU non-linearity in the MLP
|
| 4. longer context length
|
| 5. architecture hyperparameter changes, e.g. slightly
| different aspect ratios
|
| And there was a paper that I can't find the reference to
| anymore that claimed that if you train long enough, the gap
| becomes even lower. Possibly because the absolutely
| positional encoding has enough time to train more fully,
| where as the RoPE layer benefits from the "inductive bias" it
| adds in the earlier stages of training.
|
| But I don't have full confidence on the above claim, maybe
| someone has tried or has better/concrete reference.
| jorlow wrote:
| Note llama's feed forward is a bit different too:
| self.w2(F.silu(self.w1(x)) * self.w3(x))
|
| I.e. the nonlinearity is a gate.
|
| https://github.com/meta-
| llama/llama3/blob/14aab0428d3ec3a959...
| lagrange77 wrote:
| Thank you for the effort you put in your educational work, it
| helped me and others a lot! In fact, i'm training my nanoGPT
| version right now. :)
|
| > Ultimately my interest in llm.c is to have a nice, clean,
| minimal, super dependency-light repo in direct C/CUDA
| implementation, which I find aesthetically pleasing.
|
| Also, it's awesome that you spend your time on your passion.
|
| Any plans on making a video series on llm.c? :D
| karpathy wrote:
| Yes definitely. Related tweet of mine:
|
| https://x.com/karpathy/status/1760388761349927356?lang=en
|
| 1. Build the thing
|
| 2. Build the ramp
|
| Currently on step 1 :). It helps to build it first so you
| know where you are going, and then you can more easily re-
| build it when you're vector pointed at the end result.
| lagrange77 wrote:
| That's fantastic. My gradient field is pointing towards it.
|
| Thank you again!
| htrp wrote:
| Everytime you take gardening leave, you build something new
| and interesting!
| 363849473754 wrote:
| You might have covered this topic before, but I'm curious about
| the main performance differences between nanoGPT and llm.c. I'm
| planning to take your "Zero to Hero" course, and I'd like to
| know how capable the nanoGPT chatbot you'll build is. Is its
| quality comparable to GPT-2 when used as a chatbot?
| karpathy wrote:
| Zero To Hero doesn't make it all the way to a chatbot, it
| stops at pretraining, and even that at a fairly small scale
| or character-level transformer on TinyShakespeare. I think
| it's a good conceptual intro but you don't get too too far as
| a competent chatbot. I think I should be able to improve on
| this soon.
| 363849473754 wrote:
| Thanks! So, you are considering expanding the Zero to Hero
| series to include building a basic GPT-2 toy chatbot? I
| believe you mentioned in one of the early lectures that you
| planned to include building a toy version of Dalle. Do you
| still have plans for that as well?
| maskil wrote:
| Please do! It's a fantastic series!
| dang wrote:
| Ok, we've changed the URL to that from
| https://twitter.com/karpathy/status/1795484547267834137 above.
| Thanks!
| karpathy wrote:
| sounds good. both work, (though) I think HN has a bit of an
| anti-twitter bias.
| pests wrote:
| First, love the videos and other work you've been doing.
| The micrograd videos are a great way to show people this is
| all math in the end, and I've linked to specific timestamps
| in that video and others more times than I can count.
|
| For why I think we have a anti-twitter bias...
|
| Twitter doesn't show replies or any further context without
| being logged in. Most people will have accounts but I know
| a lot here deleted theirs or refuse to use it for one
| reason or another.
|
| Also IMO most here are going to want to read the full
| source so it just cuts out the middleman. This would
| usually fall under the "Please submit the original source.
| If a post reports on something found on another site,
| submit the latter." guideline which is a little different
| since the source is yourself, but still the Twitter post
| doesn't add anything new or novel.
| m11a wrote:
| Why write in CUDA and not just use PyTorch etc?
|
| if performance, how much faster is it, out of curiosity?
| kgwgk wrote:
| > Why write in CUDA and not just use PyTorch etc?
|
| "LLM training in simple, pure C/CUDA. There is no need for
| 245MB of PyTorch or 107MB of cPython. [...] A few more words
| on what I want this repo to be: First, I want llm.c to be a
| place for education."
| simonw wrote:
| > Keep in mind that here we trained for 10B tokens, while GPT-3
| models were all trained for 300B tokens. [...] GPT-3 actually
| didn't change too much at all about the model (context size
| 1024 -> 2048, I think that's it?).
|
| Andrej, based on that do you have a rough cost estimate for
| what it would take to train a GPT-3 Ada (350M)? Do you plan to
| get there with llm.c ?
| karpathy wrote:
| The 350M model I trained last night was 30B tokens, 14 hours,
| ~$200. Conveniently, 300B is exactly 10X the tokens so ~$2K
| would be the estimate. You'd have to wait 140 hours on one
| box though. Getting an H100 box instead of A100 will already
| cut the time latency down probably by a factor of 2-3X, for
| free, even without going to fp8 (which we do plan to
| support).
|
| So TLDR at this model scale, llm.c is already there
| functionally, I think, it's a matter of the compute resources
| and patience. I currently have this one box from Lambda and I
| have to look around for a few more boxes and merge the
| pending PR for multi-node training support. Getting all of
| this into a nice, stable state is probably a good chunk of
| the pending work right now.
| localhost wrote:
| How large is the set of binaries needed to do this training
| job? The current pytorch + CUDA ecosystem is so incredibly
| gigantic and manipulating those container images is painful
| because they are so large. I was hopeful that this would be the
| beginnings of a much smaller training/fine-tuning stack?
| karpathy wrote:
| That is 100% my intention and hope and I think we are very
| close to deleting all of that. Right now on master, I am
| already only using Python for the tokenization preprocessing.
| In principle the requirements for llm.c should be extremely
| minimal. I think this a few days of work that is high on my
| mind.
|
| Biggest problem right now is finding a place that can host
| the 135GB of tokens for FineWeb100B. Will probably use S3 or
| something.
|
| Related see: https://github.com/karpathy/llm.c/issues/482
| notg963 wrote:
| Do you have plans to create videos for the llm.c?
| anoy8888 wrote:
| Can it be done in rust ?
| celltalk wrote:
| Time for llm videos!
| natsucks wrote:
| In your opinion is it important for ML engineers to know C?
| brcmthrowaway wrote:
| 0% chance
| esafak wrote:
| You'd have to be deep into ML Infrastructure to use C, probably
| via CUDA. No-one who develops or uses ML models touches C or
| even C++. tinygrad and llama.cpp are exceptions.
___________________________________________________________________
(page generated 2024-05-28 23:00 UTC)