https://github.com/karpathy/llm.c/discussions/481 Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} karpathy / llm.c Public * Notifications You must be signed in to change notification settings * Fork 2.1k * Star 19.4k * Code * Issues 57 * Pull requests 58 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481 karpathy started this conversation in General Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20 #481 @karpathy karpathy May 28, 2024 * 9 comments * 19 replies Return to top Discussion options * {{title}} Something went wrong. Quote reply [241] karpathy May 28, 2024 Maintainer - Let's reproduce the GPT-2 (124M) in llm.c (~4,000 lines of C/CUDA) in 90 minutes for $20. The 124M model is the smallest model in the GPT-2 series released by OpenAI in 2019, and is actually quite accessible today, even for the GPU poor. With llm.c, which is quite efficient at up to ~60% model flops utilization, reproducing this model on one 8X A100 80GB SXM node takes ~90 minutes. For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU). In addition, llm.c still has a lot of pending optimizations and people haven't tried to tune the training in the style of cramming, so I'd say we're likely to see significant improvements on this number. So here is the run, training the 12-layer, 12-headed, 768-dimension, 124M Transformer on 10 billion tokens of FineWeb: chart124M The left pane shows that we outperform the checkpoint released by OpenAI on the FineWeb withheld validation dataset. This is not the ideal metric because the data distribution of GPT-2 was different (it was trained on the never released "WebText" dataset) and the statistics of the internet may have been different 5 years ago, so it's not a super fair comparison. Therefore, in addition on the right we also plot the HellaSwag accuracy, a benchmark commonly used to assess LLM capability that is nice, smooth, and well-behaved. I'd mostly look at HellaSwag, but FineWeb val is a nice confirmation. That said, HellaSwag has no math/code so it slightly favors our setting (common crawl-like data). One more point of reference is that GPT-3 in Appendix H cites HellaSwag accuracy at 33.7 for GPT-3 Small (124M) model. We get to 29.9 here, which surpasses GPT-2 (124M) at 29.4. Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. Now here is the shortest path to reproducing this result yourself. You'll need a GPU. I like and run my work on Lambda labs (who graciously sponsors in llm.c development), though the inventory can be limited at times. Many other providers exist and you can use the Discussion below for tips and tricks around this. Here is the example process for a Linux x86 64bit Ubuntu 22.04 with CUDA 12 (this is somewhere around the current, default "modern" configuration). If you're on a different system, the comments and discussion in the main README file might be helpful. # install miniconda mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash source ~/.bashrc # pytorch nightly (optional) https://pytorch.org/get-started/locally/ # conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia # pip installs so we can tokenize the FineWeb dataset yes | pip install tqdm tiktoken requests datasets # install cudnn so we can use FlashAttention and run fast (optional) # https://developer.nvidia.com/cudnn-downloads # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb sudo dpkg -i cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb sudo cp /var/cudnn-local-repo-ubuntu2204-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cudnn-cuda-12 # "install" cudnn-frontend to ~/ git clone https://github.com/NVIDIA/cudnn-frontend.git # install MPI (optional, if you intend to use multiple GPUs) sudo apt install openmpi-bin openmpi-doc libopenmpi-dev # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb git clone https://github.com/karpathy/llm.c.git cd llm.c python dev/data/fineweb.py --version 10B # compile llm.c (mixed precision, with cuDNN flash-attention) # first compilation is ~1 minute, mostly due to cuDNN make train_gpt2cu USE_CUDNN=1 # train on a single GPU ./train_gpt2cu \ -i "dev/data/fineweb10B/fineweb_train_*.bin" \ -j "dev/data/fineweb10B/fineweb_val_*.bin" \ -o log124M \ -e "d12" \ -b 64 -t 1024 \ -d 524288 \ -r 1 \ -z 1 \ -c 0.1 \ -l 0.0006 \ -q 0.0 \ -u 700 \ -n 5000 \ -v 250 -s 20000 \ -h 1 # if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.: # mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same) Args guide. A lot of these hyperparameters follow the GPT-3 paper instead of the GPT-2 paper, because it was a lot more detailed. Args explanation: * -i -j are training and validation splits token files, written by fineweb.py * -o is the output directory to write logs and checkpoints into * -e "d12" asks to initialize, a depth 12 GPT-2 model from scratch * -b 64 sets the micro-batch size to 64 . If you are running out of memory, decrease this value, e.g. try 32, 16, 8, all the way down to 1 potentially. * -t 1024 sets the maximum sequence length to 1024, as GPT-2 did * -d 524288 requests that the total batch size per single update be ~0.5M tokens. The code will take this desired batch size and calculate the needed gradient accumulation "inner loop" steps of the optimization. For example on 8 GPUs, at -b 64 and -t 1024, every microbatch is doing exactly 8 X 64 X 1024 = 524288 tokens, so there is no need for gradient accumulation. But if we we only have 1 GPU, then the code will set it to 8, and do an inner loop of 8 iterations to add up to this "total batch size" per step. While the batch size used to train GPT-2 is unknown, this number ~0.5M comes from the GPT-3 paper table, for this model size. * -r 1 sets the recompute setting = 1, so we will re-compute the GeLU activations. This slightly increases the runtime, but saves quite a bit of memory, allowing us to increase the batch size and get a net increase in token throughput. * -z 1 turns on ZeRO-1 (i.e. optimizer state sharding) across multiple GPUs. If you're training with > 1 GPU, this setting is a no-brainer and should basically always be on. On 1 GPU this setting is a no-op. * -c 0.1 sets the weight decay to 0.1. Only (2D) weights are decayed exactly as in GPT-2, and this number comes from the GPT-3 paper * -l 0.0006 sets the maximum learning rate, from GPT-3 paper. * -q 0.0 says that we will decay the learning rate to 0 over the course of training. * -u 700 says that we will ramp up the learning rate from 0 to max learning rate over the first 700 iterations, which at total batch size 0.5M is 350M tokens, following GPT-3 paper. * -n 5000 asks to save model checkpoints every 5000 steps. * -v 250 asks to evaluate and log the validation loss every 250 steps * -s 20000 asks to sample some tokens every 20000 steps. Because the total number of steps will be less than this (see below), this basically turns generation off and we will only basically sample a single time at the very end. * -h 1 asks to evaluate the HellaSwag accuracy, something we can compare across papers. * Because we did not set the maximum number of steps using -x flag, it defaults to exactly one epoch over the training data, i.e. 10B tokens. Because the total batch size is ~0.5M and total number of tokens is 10B, there will be a total of ~ 10B/0.5M = 20K steps. There's a lot of detail above but the TLDR is that we're training a 12-layer GPT-2 (124M), from scratch, on 10B tokens of FineWeb, with max sequence length of 1024 tokens. If you are running out of memory, I would first make sure you have -r 1 turned on, and then I would start decreasing the batch size -b by dividing it by 2, until the runs. Once it runs, I'd see if you can get away with turning -r 0 back on to recover a little bit of speed. Training. The code will print something like this over time (this is an example of a single A100 40GB PCIe GPU, $1.29/hr): step 80/18865 | train loss 7.577051 | norm 1.1461 | lr 6.86e-05 | 2950.68 ms | 49.0% A100 fp16 MFU | 177968 tok/s step 81/18865 | train loss 7.540626 | norm 1.4001 | lr 6.94e-05 | 2952.59 ms | 49.0% A100 fp16 MFU | 177948 tok/s step 82/18865 | train loss 7.465753 | norm 1.0613 | lr 7.03e-05 | 2953.98 ms | 48.9% A100 fp16 MFU | 177924 tok/s step 83/18865 | train loss 7.472681 | norm 1.1553 | lr 7.11e-05 | 2955.67 ms | 48.9% A100 fp16 MFU | 177897 tok/s What is going on? Well, we have 10B training tokens and our batch size is ~0.5M, so we'd expect about 10B/0.5M ~= 20K steps in total. It actually works out to exactly 18,865 because one of the data shards is reserved for validation data and the exact batch size is a nice power of 2 @ 524,288. So here we are on step 80/18865, which in total took 2950.68ms. MFU is short for "Model Flops Utilization". The A100 claims to offer 312 TFLOPS, but in practice this is very hard to achieve because the training is memory-bound and we can't feed the TensorCores that do the matrix multiplies. On this A100 40GB PCIe GPU, we see that when we count up the FLOPs we're doing and divide by time, we're roughly at half the theoretical, maximum peak FLOPS, which is quite good. If you used the A100 80GB SXM with higher memory bandwidth and max thermal design power, this goes up to ~60%. (If you use a GPU that is not A100, ignore this number because it is in units of A100 fp16 FLOPS). We also see that the token throughput we are achieving is about 178K tok/s. Next, our current loss is 7.577. The lower this is, the better our model is at predicting the next token in the sequence on average. Step 80 is very early in the training here. Because the perplexity is exp(7.577) ~= 2K, our model is as confused about each next token on average, as if it was guessing at random from 2,000 tokens. The full vocab size is 50,257. By the end of the optimization we'll get to about 3.29, so it's as if we're guessing uniformly at random from exp(3.29) ~= 27 tokens at each time step. Finally we see the gradient norm is 1.1461. When this number spikes, the gradient is exploding and this is very bad. To mitigate gradient explosions, as is standard, llm.c uses gradient clipping at 1.0, so if the gradient norm exceeds 1.0 (like in this time step) we forcefully scale it down so it's norm is up to 1.0. Later in the optimization, the gradient norm usually "calms down" to lower values. Visualization. Finally, you'll want to make pretty charts like the one I posted up above. For that, our program is printing some very rudimentary logs to an improvised log124M/main.log file. I have attached an example Jupyter notebook that parses these files and visualizes them in the style above. Tokenizer. When you're training up above, you'll see a warning that llm.c couldn't find the GPT-2 tokenizer .bin file. That's totally fine for training, but it means that we can't decode - i.e. we can't convert integer tokens that we sample into little string pieces, to create text that we can read. Here is how we can generate it: # install pytorch nightly conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia # install huggingface transformers pip install transformers # preprocess the TinyShakespeare dataset (very fast, much faster than FineWeb) python dev/data/tinyshakespeare.py # run a little training loop in Python/PyTorch # it saved a lot of .bin files, including the Tokenizer python train_gpt2.py The Python script is a parallel implementation to llm.c used for error checking and unit tests (but doesn't have full feature parity). In particular, if we run it like above it will write the file gpt2_tokenizer.bin, which the C code can read and use to output nice text during sampling. Sampling. The code is currently not really intended for inference, but you can hack the code to do inference very inefficiently (without any kv-cache etc.) with something like this: make train_gpt2cu USE_CUDNN=1 ./train_gpt2cu \ -i "dev/data/fineweb10B/fineweb_train_*.bin" \ -j "dev/data/fineweb10B/fineweb_val_*.bin" \ -e "log124M/gpt2_124M_00018865.bin" \ -b 1 -t 1024 \ -x 1 \ -l 0.0 \ -s 1 -g 256 The -i -j flags are spurious. -e flag is pointing at the final model checkpoint of our GPT-2 124M model, which llm.c will initialize the model from. The -b 1 is saying to use only a single batch element (one row of length 1024 tokens in which we sample from left to right). The -x 1 is saying we only want to run for a single step, and -l 0.0 is setting the learning rate to zero so we don't actually train the model on this single step. Finally -s 1 is saying "sample every step" and -g 256 is saying sample 256 tokens. Now, the above is just unconditional sampling. It's possible to hack the code to do conditional sampling, i.e. sequence completion. E.g. I asked our 124M model to complete the text "The GitHub project llm.c is a", and it continued: "free service to enhance the scholarly infrastructure of the academic community.". I then re-sampled with a different seed and got "The GitHub project llm.c is a collaborative effort that rocks GitHub itself". So, not bad I guess :) I had to directly hack the code by setting gen_tokens[1:10] to be the prompt tokens 464, 21722, 1628, 32660, 76, 13, 66, 318, 257 (from tiktokenizer ty), then hacked the loop index that samples to start at token position 10, ... you get the idea TLDR conditional generation is not really supported but in principle possible, possibly coming soon. Code. 95% of the heavy lifting is in the train_gpt2.cu file. It started as a nice clean 1,000 LOC C code, but has grown quite a bit and now it's closer to 3,500 LOC, with 4 supporting files of file I/O utils, tokenizer, dataloader, and random number generation. Roughly speaking, the first 500 LOC are just basic setup of up MPI, NCCL, cuDNN, cuBLAS, etc etc. The next 1,500 LOC are all the layers of the Transformer, and both their forward and backward implementation in efficient CUDA code. All the CUDA kernel development for these files happens in dev/cuda. So for example there is a gelu_forward() and then also a gelu_backward(), and the same way for all the other layers. The next 1,000 LOC are the gpt2 model, which just strings together the layers and itself has one big gpt2_forward() and gpt2_backward(). The last 1,000 LOC are int main(), which has the main training loop and all the related bookkeeping and argument parsing, and a lot of tedious code around e.g. resuming training from a previous checkpoint, etc. 350M model. Overnight I also reproduced the 350M parameter model. Take a look at the file run350M.sh for the exact launch command. I found that 10B tokens was not enough for the 350M model, so you'll have to download and preprocess the FineWeb100B (or try to do multiple epochs on just the 10B above, which might work, I have not checked). I configured it to train for 30B tokens, so we have that: FLOPS using 6ND approximation: * 124M on 10B tokens => 6 * 124e6 * 10e9 = 7.44e18 ~= 7e18 capability model * 350M on 30B tokens => 6 * 350e6 * 31.5e9 = 6.615e19 ~= 7e19 capability model (~10X) On 8X A100 80GB SXM the 350M stepped at 820ms/iter. Trained for 60K steps (instead of ~20K), for a total of ~30B tokens (instead of ~10B tokens). Total training time 14 hours. Cost $14/hr => 14 X 14 ~= $200 (10X of 124M). However looking at the plot, it's possible that we could have gotten away with slightly less: chart350M Coming up. That's it for now! We are moving on to the 740M and then, of course, the actual "GPT-2" 1558M. If I can find the GPUs... By very rough napkin math, on my single 8X A100 80GB GPU box, the 1558M model would take ~1 week and cost ~$2.5K. This is in acceptable territory, but we'll want to take some time to make the current code better, cleaner, better tested, and add multi-node training support. And also very much still on my mind, I want to build the whole thing again, from scratch and piece by piece, coming to you soon^TM. FAQ: * Can I sample from it? kind of, but it's inefficient and a bit weird. * Can I chat with it? no, this is currently only pretraining, not chat finetuning. * Can you train multi-node distributed? in principle yes, there is a slurm PR up that got this working for up 50 nodes. In practice I personally haven't tried yet. * Are you bitwise deterministic? No but we are very close, one more kernel to patch. * Can you train in fp8? No, we're currently mostly training in bf16, but coming soon. * I have a non-NVIDIA GPU (AMD, Apple Silicon, etc.) can I run llm.c? No, llm.c supports C/CUDA only, but I am very happy to link to any forks under "notable forks" section, or accept PRs that would make porting llm.c to other platforms easier. * I only have a CPU, can I play? You won't be able to reproduce GPT-2 models, but you can take on fun projects by finetuning OpenAI GPT-2 models on other data, e.g. TinyShakespeare or TinyStories. Support for these datasets, initialization, and CPU finetuning exists in llm.c in train_gpt2.c. (It's a lot more rudimentary though, intended mostly as a reference for the CUDA code). * How does this compare to PyTorch? llm.c is a "straight up" C/CUDA implementation. The PyTorch code at train_gpt2.py does not have full feature parity (e.g. doesn't do sharded data loading, etc.) and is meant to be more as a reference, but I think you can get something similar to the 124M model above stepping as follows: torchrun --standalone --nproc_per_node=4 python train_gpt2.py --input_bin dev/data/fineweb10B/fineweb_train_000001.bin --write_tensors 0 --model d12 --batch_size 64 --sequence_length 1024 --total_batch_size 524288 --dtype bfloat16 --compile 1 --tensorcores 1 --flash 1 --num_iterations 18865 --weight_decay 0.1 --overfit_single_batch 0. I am interested in and would accept PRs that bring the PyTorch training closer up to feature parity to the llm.c training loop. * Why do you care so much about GPT-2? GPT-2 is the grand-daddy of LLMs, the first time that the modern LLM stack came together in a recognizably modern form, and the parameters were released by OpenAI. GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?). GPT-4 details were never published. Many other LLMs also strongly resemble GPT-2, despite it being from 2019, e.g. Llama 3 from the architecture perspective is a non-linearity change in the MLP and the addition of the RoPE relative positional encoding. Acknowledgements Call out to @ngc92 and @ademeure who have both made substantial contributions to llm.c across the board and especially on CUDA kernel optimization, @chinthysl and @PeterZhizhin for distributed optimization PRs, and @rosslwheeler for Windows support and tooling. Please feel free to use the Discussions for any FAQ and related, or if you'd like something faster, #llmc on Discord, or #llmdotc on CUDA MODE Discord. Beta Was this translation helpful? Give feedback. 40 You must be logged in to vote 43 10 [?] 56 19 2 All reactions * 43 * 10 * [?] 56 * 19 * 2 Replies: 9 comments * 19 replies * Oldest * Newest * Top Comment options * {{title}} Something went wrong. Quote reply [236] timlmit May 28, 2024 - boss <3 Beta Was this translation helpful? Give feedback. 3 You must be logged in to vote All reactions 0 replies Comment options * {{title}} Something went wrong. Quote reply [241] karpathy May 28, 2024 Maintainer Author - Few more pointers: * I answered some questions in HN thread * I answered some question in X thread Beta Was this translation helpful? Give feedback. 2 You must be logged in to vote 1 All reactions * 1 0 replies Comment options * {{title}} Something went wrong. Quote reply [715] Niskarsh12 May 28, 2024 - I wanted to try it but sadly I am gpu poor :/ Beta Was this translation helpful? Give feedback. 2 You must be logged in to vote All reactions 2 replies @jamesalmeida Comment options * {{title}} Something went wrong. Quote reply jamesalmeida May 28, 2024 - Same Beta Was this translation helpful? Give feedback. All reactions @orph Comment options * {{title}} Something went wrong. Quote reply orph May 28, 2024 - For example, on Lambda this node goes for ~$14/hr, so the total cost of reproducing this model today is about $20. Beta Was this translation helpful? Give feedback. All reactions Comment options * {{title}} Something went wrong. Quote reply edited * {{editor}}'s edit {{actor}} deleted this content . {{editor}}'s edit Something went wrong. [824] sebhtml May 28, 2024 - Thank you @karpathy for your valuable teaching lessons in your GitHub repositories. I cloned llm.c to check how you do the dropout. I found some random number generation functions that run on the NVIDIA CUDA GPU devices. Where is the Dropout being performed ? Beta Was this translation helpful? Give feedback. 2 You must be logged in to vote 1 All reactions * 1 2 replies @karpathy Comment options * {{title}} Something went wrong. Quote reply karpathy May 28, 2024 Maintainer Author - There is no Dropout right now. It probably works a bit better to add weak Dropout, e.g. 0.05, but it introduces a layer of complexity around having a train and eval mode, and dealing with all of that is too much headache at this point. We're training very small models on sufficiently large datasets, so I think there is less need for regularization. It still probably helps a bit. Beta Was this translation helpful? Give feedback. 1 All reactions * 1 @sebhtml Comment options * {{title}} Something went wrong. Quote reply sebhtml May 28, 2024 - OK, thank you Mr @karpathy ! Beta Was this translation helpful? Give feedback. All reactions Comment options * {{title}} Something went wrong. Quote reply edited * {{editor}}'s edit {{actor}} deleted this content . {{editor}}'s edit Something went wrong. [824] sebhtml May 28, 2024 - Hello Mr @karpathy I saw MPI_Allgather in https://github.com/karpathy/llm.c/blob/master/ train_gpt2.cu#L425 Why is MPI_Allgather used here if all the 8 A100 80GB SXM GPUs are on the same node ? Beta Was this translation helpful? Give feedback. 1 You must be logged in to vote All reactions 3 replies @karpathy Comment options * {{title}} Something went wrong. Quote reply karpathy May 28, 2024 Maintainer Author - We wish to support multinode training imminently. Beta Was this translation helpful? Give feedback. 3 All reactions * 3 @sebhtml Comment options * {{title}} Something went wrong. Quote reply sebhtml May 28, 2024 - Oh this is exciting ! Beta Was this translation helpful? Give feedback. All reactions @karpathy Comment options * {{title}} Something went wrong. Quote reply karpathy May 28, 2024 Maintainer Author - There's already a PR up (linked to above in the post), but I haven't gotten around to it yet. Beta Was this translation helpful? Give feedback. All reactions Comment options * {{title}} Something went wrong. Quote reply [151] YuchenJin May 28, 2024 - There are a few places where train_gpt2cu should be changed to train_gpt2.cu. Beta Was this translation helpful? Give feedback. 1 You must be logged in to vote All reactions 9 replies Show 4 previous replies @karpathy Comment options * {{title}} Something went wrong. Quote reply karpathy May 28, 2024 Maintainer Author - Yes it hard-codes A100 right now. The calculation is very simple, right here: https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L2875 Just swap in the H100 value for fp16, without sparsity, 989.5 I think. (would accept PRs that generalize this and detect which GPU a person is using and what the peak fp16 flops are for it.) Beta Was this translation helpful? Give feedback. 1 All reactions * 1 @ngc92 Comment options * {{title}} Something went wrong. Quote reply ngc92 May 28, 2024 - we actually already know which device we're running on, that can be found in deviceProp.name. we even print that out once the training starts, so really, we'd just need to strcmp this to a bunch of known device names and get a list of expected TFLOPS. Beta Was this translation helpful? Give feedback. 2 All reactions * 2 @YuchenJin Comment options * {{title}} Something went wrong. Quote reply YuchenJin May 28, 2024 - @ngc92 that would be awesome! Beta Was this translation helpful? Give feedback. All reactions @ngc92 Comment options * {{title}} Something went wrong. Quote reply ngc92 May 28, 2024 - @YuchenJin could you check what name the H100 gets when we print out the device name in the beginning? Beta Was this translation helpful? Give feedback. All reactions @YuchenJin Comment options * {{title}} Something went wrong. Quote reply YuchenJin May 28, 2024 - Screenshot 2024-05-28 at 2 43 05 PM It prints "NVIDIA H100 80GB HBM3", the full screenshot is attached. Beta Was this translation helpful? Give feedback. All reactions Comment options * {{title}} Something went wrong. Quote reply edited * {{editor}}'s edit {{actor}} deleted this content . {{editor}}'s edit Something went wrong. [153] banyan-god May 28, 2024 - Here is a model (500M) i am training it for last few days using llama2 architecture. Hoping to train it to around 200 billion tokens . This is using fineweb 2024 and 4x 4090 https://wandb.ai/banyan-t/llamac/runs/zjaods8q n_heads:36 n_kv_heads:36 n_layers:36 dim: 970 Beta Was this translation helpful? Give feedback. 1 You must be logged in to vote 1 All reactions * 1 0 replies Comment options * {{title}} Something went wrong. Quote reply [550] bprimal22 May 28, 2024 - What a fucking legend. I'm starting on this tonight!! Beta Was this translation helpful? Give feedback. 1 You must be logged in to vote All reactions 0 replies Comment options * {{title}} Something went wrong. Quote reply [116] dbmanifest May 28, 2024 - "You can train the model with a single GPU too, it would just take proportionally longer (e.g. ~4-24 hours depending on the GPU)" Sorry to bother, but whats the oldest/cheapest/weakest GPU that will be able to train this within 24 hours? Beta Was this translation helpful? Give feedback. 1 You must be logged in to vote All reactions 3 replies @fernand Comment options * {{title}} Something went wrong. Quote reply edited * {{editor}}'s edit {{actor}} deleted this content . {{editor}}'s edit Something went wrong. fernand May 28, 2024 - 24 hours needs 160+ theoretical BF16 Tensor Core TOPs assuming ~60% MFU, so just over one RTX 3090. 50% MFU is pretty likely for most high memory bandwidth PCIe GPUs. Beta Was this translation helpful? Give feedback. All reactions @fernand Comment options * {{title}} Something went wrong. Quote reply fernand May 28, 2024 - The Titan RTX should likely take <30 hours with FP16 training using loss scaling. Beta Was this translation helpful? Give feedback. All reactions @ngc92 Comment options * {{title}} Something went wrong. Quote reply ngc92 May 28, 2024 - loss scaling isn't implemented (yet) Beta Was this translation helpful? Give feedback. [?] 1 All reactions * [?] 1 Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment Category General Labels None yet 12 participants @karpathy @orph @jamesalmeida @fernand @sebhtml @ngc92 @YuchenJin @timlmit @bprimal22 @Niskarsh12 @dbmanifest @banyan-god Heading Bold Italic Quote Code Link --------------------------------------------------------------------- Numbered list Unordered list Task list --------------------------------------------------------------------- Attach files Mention Reference Menu * Heading * Bold * Italic * Quote * Code * Link * * Numbered list * Unordered list * Task list * * Attach files * Mention * Reference Select a reply Create a new saved reply 1 reacted with thumbs up emoji 1 reacted with thumbs down emoji 1 reacted with laugh emoji 1 reacted with hooray emoji 1 reacted with confused emoji [?] 1 reacted with heart emoji 1 reacted with rocket emoji 1 reacted with eyes emoji Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.