https://github.com/karpathy/llm.c/discussions/481

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
karpathy / llm.c Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 2.1k
  * Star 19.4k

  * Code
  * Issues 57
  * Pull requests 58
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481

karpathy started this conversation in General
Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481
@karpathy karpathy
May 28, 2024 * 9 comments * 19 replies
Return to top
Discussion options

  * 

{{title}}

Something went wrong.
Quote reply

[241]
karpathy
May 28, 2024
Maintainer

-

Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in
90 minutes for $20. The 124M model is the smallest model in the GPT-2
series released by OpenAI in 2019, and is actually quite accessible
today, even for the GPU poor. With llm.c, which is quite efficient at
up to ~60% model flops utilization, reproducing this model on one 8X
A100 80GB SXM node takes ~90 minutes. For example, on Lambda this
node goes for ~$14/hr, so the total cost of reproducing this model
today is about $20. You can train the model with a single GPU too, it
would just take proportionally longer (e.g. ~4-24 hours depending on
the GPU). In addition, llm.c still has a lot of pending optimizations
and people haven't tried to tune the training in the style of
cramming, so I'd say we're likely to see significant improvements on
this number. So here is the run, training the 12-layer, 12-headed,
768-dimension, 124M Transformer on 10 billion tokens of FineWeb:

chart124M

The left pane shows that we outperform the checkpoint released by
OpenAI on the FineWeb withheld validation dataset. This is not the
ideal metric because the data distribution of GPT-2 was different (it
was trained on the never released "WebText" dataset) and the
statistics of the internet may have been different 5 years ago, so
it's not a super fair comparison. Therefore, in addition on the right
we also plot the HellaSwag accuracy, a benchmark commonly used to
assess LLM capability that is nice, smooth, and well-behaved. I'd
mostly look at HellaSwag, but FineWeb val is a nice confirmation.
That said, HellaSwag has no math/code so it slightly favors our
setting (common crawl-like data). One more point of reference is that
GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small
(124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at
29.4. Keep in mind that here we trained for 10B tokens, while GPT-3
models were all trained for 300B tokens.

Now here is the shortest path to reproducing this result yourself.
You'll need a GPU. I like and run my work on Lambda labs (who
graciously sponsors in llm.c development), though the inventory can
be limited at times. Many other providers exist and you can use the
Discussion below for tips and tricks around this. Here is the example
process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is
somewhere around the current, default "modern" configuration). If
you're on a different system, the comments and discussion in the main
README file might be helpful.

# install miniconda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc

# pytorch nightly (optional) https://pytorch.org/get-started/locally/
# conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# pip installs so we can tokenize the FineWeb dataset
yes | pip install tqdm tiktoken requests datasets

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn-cuda-12

# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git

# install MPI (optional, if you intend to use multiple GPUs)
sudo apt install openmpi-bin openmpi-doc libopenmpi-dev

# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
git clone https://github.com/karpathy/llm.c.git
cd llm.c
python dev/data/fineweb.py --version 10B

# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1

# train on a single GPU
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

# if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.:
# mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same)

Args guide. A lot of these hyperparameters follow the GPT-3 paper
instead of the GPT-2 paper, because it was a lot more detailed. Args
explanation:

  * -i -j are training and validation splits token files, written by
    fineweb.py
  * -o is the output directory to write logs and checkpoints into
  * -e "d12" asks to initialize, a depth 12 GPT-2 model from scratch
  * -b 64 sets the micro-batch size to 64 . If you are running out of
    memory, decrease this value, e.g. try 32, 16, 8, all the way down
    to 1 potentially.
  * -t 1024 sets the maximum sequence length to 1024, as GPT-2 did
  * -d 524288 requests that the total batch size per single update be
    ~0.5M tokens. The code will take this desired batch size and
    calculate the needed gradient accumulation "inner loop" steps of
    the optimization. For example on 8 GPUs, at -b 64 and -t 1024,
    every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens,
    so there is no need for gradient accumulation. But if we we only
    have 1 GPU, then the code will set it to 8, and do an inner loop
    of 8 iterations to add up to this "total batch size" per step.
    While the batch size used to train GPT-2 is unknown, this number
    ~0.5M comes from the GPT-3 paper table, for this model size.
  * -r 1 sets the recompute setting = 1, so we will re-compute the
    GeLU activations. This slightly increases the runtime, but saves
    quite a bit of memory, allowing us to increase the batch size and
    get a net increase in token throughput.
  * -z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across
    multiple GPUs. If you're training with > 1 GPU, this setting is a
    no-brainer and should basically always be on. On 1 GPU this
    setting is a no-op.
  * -c 0.1 sets the weight decay to 0.1. Only (2D) weights are
    decayed exactly as in GPT-2, and this number comes from the GPT-3
    paper
  * -l 0.0006 sets the maximum learning rate, from GPT-3 paper.
  * -q 0.0 says that we will decay the learning rate to 0 over the
    course of training.
  * -u 700 says that we will ramp up the learning rate from 0 to max
    learning rate over the first 700 iterations, which at total batch
    size 0.5M is 350M tokens, following GPT-3 paper.
  * -n 5000 asks to save model checkpoints every 5000 steps.
  * -v 250 asks to evaluate and log the validation loss every 250
    steps
  * -s 20000 asks to sample some tokens every 20000 steps. Because
    the total number of steps will be less than this (see below),
    this basically turns generation off and we will only basically
    sample a single time at the very end.
  * -h 1 asks to evaluate the HellaSwag accuracy, something we can
    compare across papers.
  * Because we did not set the maximum number of steps using -x flag,
    it defaults to exactly one epoch over the training data, i.e. 10B
    tokens. Because the total batch size is ~0.5M and total number of
    tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps.

There's a lot of detail above but the TLDR is that we're training a
12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with
max sequence length of 1024 tokens. If you are running out of memory,
I would first make sure you have -r 1 turned on, and then I would
start decreasing the batch size -b by dividing it by 2, until the
runs. Once it runs, I'd see if you can get away with turning -r 0
back on to recover a little bit of speed.

Training. The code will print something like this over time (this is
an example of a single A100 40GB PCIe GPU, $1.29/hr):

step   80/18865 | train loss 7.577051 | norm 1.1461 | lr 6.86e-05 | 2950.68 ms | 49.0% A100 fp16 MFU | 177968 tok/s
step   81/18865 | train loss 7.540626 | norm 1.4001 | lr 6.94e-05 | 2952.59 ms | 49.0% A100 fp16 MFU | 177948 tok/s
step   82/18865 | train loss 7.465753 | norm 1.0613 | lr 7.03e-05 | 2953.98 ms | 48.9% A100 fp16 MFU | 177924 tok/s
step   83/18865 | train loss 7.472681 | norm 1.1553 | lr 7.11e-05 | 2955.67 ms | 48.9% A100 fp16 MFU | 177897 tok/s

What is going on? Well, we have 10B training tokens and our batch
size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total.
It actually works out to exactly 18,865 because one of the data
shards is reserved for validation data and the exact batch size is a
nice power of 2 @ 524,288. So here we are on step 80/18865, which in
total took 2950.68ms. MFU is short for "Model Flops Utilization". The
A100 claims to offer 312 TFLOPS, but in practice this is very hard to
achieve because the training is memory-bound and we can't feed the
TensorCores that do the matrix multiplies. On this A100 40GB PCIe
GPU, we see that when we count up the FLOPs we're doing and divide by
time, we're roughly at half the theoretical, maximum peak FLOPS,
which is quite good. If you used the A100 80GB SXM with higher memory
bandwidth and max thermal design power, this goes up to ~60%. (If you
use a GPU that is not A100, ignore this number because it is in units
of A100 fp16 FLOPS). We also see that the token throughput we are
achieving is about 178K tok/s. Next, our current loss is 7.577. The
lower this is, the better our model is at predicting the next token
in the sequence on average. Step 80 is very early in the training
here. Because the perplexity is exp(7.577) ~= 2K, our model is as
confused about each next token on average, as if it was guessing at
random from 2,000 tokens. The full vocab size is 50,257. By the end
of the optimization we'll get to about 3.29, so it's as if we're
guessing uniformly at random from exp(3.29) ~= 27 tokens at each time
step. Finally we see the gradient norm is 1.1461. When this number
spikes, the gradient is exploding and this is very bad. To mitigate
gradient explosions, as is standard, llm.c uses gradient clipping at
1.0, so if the gradient norm exceeds 1.0 (like in this time step) we
forcefully scale it down so it's norm is up to 1.0. Later in the
optimization, the gradient norm usually "calms down" to lower values.

Visualization. Finally, you'll want to make pretty charts like the
one I posted up above. For that, our program is printing some very
rudimentary logs to an improvised log124M/main.log file. I have
attached an example Jupyter notebook that parses these files and
visualizes them in the style above.

Tokenizer. When you're training up above, you'll see a warning that
llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally
fine for training, but it means that we can't decode - i.e. we can't
convert integer tokens that we sample into little string pieces, to
create text that we can read. Here is how we can generate it:

# install pytorch nightly
conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# install huggingface transformers
pip install transformers

# preprocess the TinyShakespeare dataset (very fast, much faster than FineWeb)
python dev/data/tinyshakespeare.py

# run a little training loop in Python/PyTorch
# it saved a lot of .bin files, including the Tokenizer
python train_gpt2.py

The Python script is a parallel implementation to llm.c used for
error checking and unit tests (but doesn't have full feature parity).
In particular, if we run it like above it will write the file
gpt2_tokenizer.bin, which the C code can read and use to output nice
text during sampling.

Sampling. The code is currently not really intended for inference,
but you can hack the code to do inference very inefficiently (without
any kv-cache etc.) with something like this:

make train_gpt2cu USE_CUDNN=1
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -e "log124M/gpt2_124M_00018865.bin" \
    -b 1 -t 1024 \
    -x 1 \
    -l 0.0 \
    -s 1 -g 256

The -i -j flags are spurious. -e flag is pointing at the final model
checkpoint of our GPT-2 124M model, which llm.c will initialize the
model from. The -b 1 is saying to use only a single batch element
(one row of length 1024 tokens in which we sample from left to
right). The -x 1 is saying we only want to run for a single step, and
-l 0.0 is setting the learning rate to zero so we don't actually
train the model on this single step. Finally -s 1 is saying "sample
every step" and -g 256 is saying sample 256 tokens.

Now, the above is just unconditional sampling. It's possible to hack
the code to do conditional sampling, i.e. sequence completion. E.g. I
asked our 124M model to complete the text "The GitHub project llm.c
is a", and it continued: "free service to enhance the scholarly
infrastructure of the academic community.". I then re-sampled with a
different seed and got "The GitHub project llm.c is a collaborative
effort that rocks GitHub itself". So, not bad I guess :) I had to
directly hack the code by setting gen_tokens[1:10] to be the prompt
tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from
tiktokenizer ty), then hacked the loop index that samples to start at
token position 10, ... you get the idea TLDR conditional generation
is not really supported but in principle possible, possibly coming
soon.

Code. 95% of the heavy lifting is in the train_gpt2.cu file. It
started as a nice clean 1,000 LOC C code, but has grown quite a bit
and now it's closer to 3,500 LOC, with 4 supporting files of file I/O
utils, tokenizer, dataloader, and random number generation. Roughly
speaking, the first 500 LOC are just basic setup of up MPI, NCCL,
cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the
Transformer, and both their forward and backward implementation in
efficient CUDA code. All the CUDA kernel development for these files
happens in dev/cuda. So for example there is a gelu_forward() and
then also a gelu_backward(), and the same way for all the other
layers. The next 1,000 LOC are the gpt2 model, which just strings
together the layers and itself has one big gpt2_forward() and
gpt2_backward(). The last 1,000 LOC are int main(), which has the
main training loop and all the related bookkeeping and argument
parsing, and a lot of tedious code around e.g. resuming training from
a previous checkpoint, etc.

350M model. Overnight I also reproduced the 350M parameter model.
Take a look at the file run350M.sh for the exact launch command. I
found that 10B tokens was not enough for the 350M model, so you'll
have to download and preprocess the FineWeb100B (or try to do
multiple epochs on just the 10B above, which might work, I have not
checked). I configured it to train for 30B tokens, so we have that:

FLOPS using 6ND approximation:

  * 124M on 10B tokens => 6 * 124e6 * 10e9 = 7.44e18 ~= 7e18
    capability model
  * 350M on 30B tokens => 6 * 350e6 * 31.5e9 = 6.615e19 ~= 7e19
    capability model (~10X)

On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K
steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B
tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200
(10X of 124M). However looking at the plot, it's possible that we
could have gotten away with slightly less:

chart350M

Coming up. That's it for now! We are moving on to the 740M and then,
of course, the actual "GPT-2" 1558M. If I can find the GPUs... By
very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M
model would take ~1 week and cost ~$2.5K. This is in acceptable
territory, but we'll want to take some time to make the current code
better, cleaner, better tested, and add multi-node training support.
And also very much still on my mind, I want to build the whole thing
again, from scratch and piece by piece, coming to you soon^TM.

FAQ:

  * Can I sample from it? kind of, but it's inefficient and a bit
    weird.
  * Can I chat with it? no, this is currently only pretraining, not
    chat finetuning.
  * Can you train multi-node distributed? in principle yes, there is
    a slurm PR up that got this working for up 50 nodes. In practice
    I personally haven't tried yet.
  * Are you bitwise deterministic? No but we are very close, one more
    kernel to patch.
  * Can you train in fp8? No, we're currently mostly training in
    bf16, but coming soon.
  * I have a non-NVIDIA GPU (AMD, Apple Silicon, etc.) can I run
    llm.c? No, llm.c supports C/CUDA only, but I am very happy to
    link to any forks under "notable forks" section, or accept PRs
    that would make porting llm.c to other platforms easier.
  * I only have a CPU, can I play? You won't be able to reproduce
    GPT-2 models, but you can take on fun projects by finetuning
    OpenAI GPT-2 models on other data, e.g. TinyShakespeare or
    TinyStories. Support for these datasets, initialization, and CPU
    finetuning exists in llm.c in train_gpt2.c. (It's a lot more
    rudimentary though, intended mostly as a reference for the CUDA
    code).
  * How does this compare to PyTorch? llm.c is a "straight up" C/CUDA
    implementation. The PyTorch code at train_gpt2.py does not have
    full feature parity (e.g. doesn't do sharded data loading, etc.)
    and is meant to be more as a reference, but I think you can get
    something similar to the 124M model above stepping as follows:
    torchrun --standalone --nproc_per_node=4 python train_gpt2.py
    --input_bin dev/data/fineweb10B/fineweb_train_000001.bin
    --write_tensors 0 --model d12 --batch_size 64 --sequence_length
    1024 --total_batch_size 524288 --dtype bfloat16 --compile 1
    --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay
    0.1 --overfit_single_batch 0. I am interested in and would accept
    PRs that bring the PyTorch training closer up to feature parity
    to the llm.c training loop.
  * Why do you care so much about GPT-2? GPT-2 is the grand-daddy of
    LLMs, the first time that the modern LLM stack came together in a
    recognizably modern form, and the parameters were released by
    OpenAI. GPT-3 actually didn't change too much at all about the
    model (context size 1024 -> 2048, I think that's it?). GPT-4
    details were never published. Many other LLMs also strongly
    resemble GPT-2, despite it being from 2019, e.g. Llama 3 from the
    architecture perspective is a non-linearity change in the MLP and
    the addition of the RoPE relative positional encoding.

Acknowledgements
Call out to @ngc92 and @ademeure who have both made substantial
contributions to llm.c across the board and especially on CUDA kernel
optimization, @chinthysl and @PeterZhizhin for distributed
optimization PRs, and @rosslwheeler for Windows support and tooling.

Please feel free to use the Discussions for any FAQ and related, or
if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA
MODE Discord.

Beta Was this translation helpful? Give feedback.

40 You must be logged in to vote
 43  10 [?] 56  19  2
All reactions

  *  43
  *  10
  * [?] 56
  *  19
  *  2

Replies: 9 comments * 19 replies

  * Oldest
  * Newest
  * Top

Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[236]
timlmit
May 28, 2024

-

boss <3

Beta Was this translation helpful? Give feedback.

3 You must be logged in to vote
All reactions

0 replies
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[241]
karpathy
May 28, 2024
Maintainer Author

-

Few more pointers:

  * I answered some questions in HN thread
  * I answered some question in X thread

Beta Was this translation helpful? Give feedback.

2 You must be logged in to vote
 1
All reactions

  *  1

0 replies
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[715]
Niskarsh12
May 28, 2024

-

I wanted to try it but sadly I am gpu poor :/

Beta Was this translation helpful? Give feedback.

2 You must be logged in to vote
All reactions

2 replies
@jamesalmeida
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

jamesalmeida May 28, 2024

-

Same

Beta Was this translation helpful? Give feedback.

All reactions

@orph
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

orph May 28, 2024

-

    For example, on Lambda this node goes for ~$14/hr, so the total
    cost of reproducing this model today is about $20.

Beta Was this translation helpful? Give feedback.

All reactions

Comment options

  * 

{{title}}

Something went wrong.
Quote reply
edited

  * 

{{editor}}'s edit

{{actor}} deleted this content .

{{editor}}'s edit

Something went wrong.

[824]
sebhtml
May 28, 2024

-

Thank you @karpathy for your valuable teaching lessons in your GitHub
repositories.

I cloned llm.c to check how you do the dropout.

I found some random number generation functions that run on the
NVIDIA CUDA GPU devices.

Where is the Dropout being performed ?

Beta Was this translation helpful? Give feedback.

2 You must be logged in to vote
 1
All reactions

  *  1

2 replies
@karpathy
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

karpathy May 28, 2024
Maintainer Author

-

There is no Dropout right now. It probably works a bit better to add
weak Dropout, e.g. 0.05, but it introduces a layer of complexity
around having a train and eval mode, and dealing with all of that is
too much headache at this point. We're training very small models on
sufficiently large datasets, so I think there is less need for
regularization. It still probably helps a bit.

Beta Was this translation helpful? Give feedback.

 1
All reactions

  *  1

@sebhtml
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

sebhtml May 28, 2024

-

OK, thank you Mr @karpathy !

Beta Was this translation helpful? Give feedback.

All reactions

Comment options

  * 

{{title}}

Something went wrong.
Quote reply
edited

  * 

{{editor}}'s edit

{{actor}} deleted this content .

{{editor}}'s edit

Something went wrong.

[824]
sebhtml
May 28, 2024

-

Hello Mr @karpathy

I saw MPI_Allgather in https://github.com/karpathy/llm.c/blob/master/
train_gpt2.cu#L425

Why is MPI_Allgather used here if all the 8 A100 80GB SXM GPUs are on
the same node ?

Beta Was this translation helpful? Give feedback.

1 You must be logged in to vote
All reactions

3 replies
@karpathy
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

karpathy May 28, 2024
Maintainer Author

-

We wish to support multinode training imminently.

Beta Was this translation helpful? Give feedback.

 3
All reactions

  *  3

@sebhtml
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

sebhtml May 28, 2024

-

Oh this is exciting !

Beta Was this translation helpful? Give feedback.

All reactions

@karpathy
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

karpathy May 28, 2024
Maintainer Author

-

 There's already a PR up (linked to above in the post), but I
haven't gotten around to it yet.

Beta Was this translation helpful? Give feedback.

All reactions

Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[151]
YuchenJin
May 28, 2024

-

There are a few places where train_gpt2cu should be changed to
train_gpt2.cu.

Beta Was this translation helpful? Give feedback.

1 You must be logged in to vote
All reactions

9 replies
Show 4 previous replies
@karpathy
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

karpathy May 28, 2024
Maintainer Author

-

Yes it hard-codes A100 right now. The calculation is very simple,
right here:
https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L2875
Just swap in the H100 value for fp16, without sparsity, 989.5 I
think.
(would accept PRs that generalize this and detect which GPU a person
is using and what the peak fp16 flops are for it.)

Beta Was this translation helpful? Give feedback.

 1
All reactions

  *  1

@ngc92
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

ngc92 May 28, 2024

-

we actually already know which device we're running on, that can be
found in deviceProp.name.
we even print that out once the training starts, so really, we'd just
need to strcmp this to a bunch of known device names and get a list
of expected TFLOPS.

Beta Was this translation helpful? Give feedback.

 2
All reactions

  *  2

@YuchenJin
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

YuchenJin May 28, 2024

-

@ngc92 that would be awesome!

Beta Was this translation helpful? Give feedback.

All reactions

@ngc92
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

ngc92 May 28, 2024

-

@YuchenJin could you check what name the H100 gets when we print out
the device name in the beginning?

Beta Was this translation helpful? Give feedback.

All reactions

@YuchenJin
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

YuchenJin May 28, 2024

-

Screenshot 2024-05-28 at 2 43 05 PM It prints "NVIDIA H100 80GB
HBM3", the full screenshot is attached.

Beta Was this translation helpful? Give feedback.

All reactions

Comment options

  * 

{{title}}

Something went wrong.
Quote reply
edited

  * 

{{editor}}'s edit

{{actor}} deleted this content .

{{editor}}'s edit

Something went wrong.

[153]
banyan-god
May 28, 2024

-

Here is a model (500M) i am training it for last few days using
llama2 architecture. Hoping to train it to around 200 billion tokens
. This is using fineweb 2024 and 4x 4090
https://wandb.ai/banyan-t/llamac/runs/zjaods8q
n_heads:36
n_kv_heads:36
n_layers:36
dim: 970

Beta Was this translation helpful? Give feedback.

1 You must be logged in to vote
 1
All reactions

  *  1

0 replies
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[550]
bprimal22
May 28, 2024

-

What a fucking legend. I'm starting on this tonight!!

Beta Was this translation helpful? Give feedback.

1 You must be logged in to vote
All reactions

0 replies
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

[116]
dbmanifest
May 28, 2024

-

"You can train the model with a single GPU too, it would just take
proportionally longer (e.g. ~4-24 hours depending on the GPU)"

Sorry to bother, but whats the oldest/cheapest/weakest GPU that will
be able to train this within 24 hours?

Beta Was this translation helpful? Give feedback.

1 You must be logged in to vote
All reactions

3 replies
@fernand
Comment options

  * 

{{title}}

Something went wrong.
Quote reply
edited

  * 

{{editor}}'s edit

{{actor}} deleted this content .

{{editor}}'s edit

Something went wrong.

fernand May 28, 2024

-

24 hours needs 160+ theoretical BF16 Tensor Core TOPs assuming ~60%
MFU, so just over one RTX 3090. 50% MFU is pretty likely for most
high memory bandwidth PCIe GPUs.

Beta Was this translation helpful? Give feedback.

All reactions

@fernand
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

fernand May 28, 2024

-

The Titan RTX should likely take <30 hours with FP16 training using
loss scaling.

Beta Was this translation helpful? Give feedback.

All reactions

@ngc92
Comment options

  * 

{{title}}

Something went wrong.
Quote reply

ngc92 May 28, 2024

-

loss scaling isn't implemented (yet)

Beta Was this translation helpful? Give feedback.

[?] 1
All reactions

  * [?] 1

Sign up for free to join this conversation on GitHub. Already have an
account? Sign in to comment
Category
 

General
Labels
None yet
12 participants
@karpathy @orph @jamesalmeida @fernand @sebhtml @ngc92 @YuchenJin 
@timlmit @bprimal22 @Niskarsh12 @dbmanifest @banyan-god
Heading
Bold
Italic
Quote
Code
Link
---------------------------------------------------------------------
Numbered list
Unordered list
Task list
---------------------------------------------------------------------
Attach files
Mention
Reference
Menu

  * Heading
  * Bold
  * Italic
  * Quote
  * Code
  * Link
  * 
  * Numbered list
  * Unordered list
  * Task list
  * 
  * Attach files
  * Mention
  * Reference

Select a reply

Create a new saved reply
 1 reacted with thumbs up emoji  1 reacted with thumbs down emoji 
1 reacted with laugh emoji  1 reacted with hooray emoji  1 reacted
with confused emoji [?] 1 reacted with heart emoji  1 reacted with
rocket emoji  1 reacted with eyes emoji

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.