https://github.com/karpathy/nanoGPT Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} karpathy / nanoGPT Public * Notifications * Fork 260 * Star 5.8k The simplest, fastest repository for training/finetuning medium-sized GPTs. License MIT license 5.8k stars 260 forks Star Notifications * Code * Issues 14 * Pull requests 6 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights karpathy/nanoGPT This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/k] Use Git or checkout with SVN using the web URL. [gh repo clone karpat] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @karpathy karpathy oh no nanoGPT is trending quickly explain the character-level functio... ... bb49751 Jan 11, 2023 oh no nanoGPT is trending quickly explain the character-level functio... ...nality I added late last night bb49751 Git stats * 52 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time assets tune the readme with new header image and the loss curve for 124M Jan 8, 2023 config add support for character-level language models, a new character-leve... Jan 11, 2023 data add support for character-level language models, a new character-leve... Jan 11, 2023 LICENSE Add MIT LICENSE file Dec 29, 2022 README.md oh no nanoGPT is trending quickly explain the character-level functio... Jan 11, 2023 bench.py copy pasting what seems to work to bench,sample as well. ty @lantiga Jan 8, 2023 configurator.py shuttling the poor mans configurator aside into its own file and addi... Jan 5, 2023 model.py disabling torch.jit.script here for massive performance boost when us... Jan 2, 2023 sample.py add support for character-level language models, a new character-leve... Jan 11, 2023 scaling_laws.ipynb progress! based on chinchilla author correspondence Jan 7, 2023 train.py add support for character-level language models, a new character-leve... Jan 11, 2023 View code [ ] nanoGPT install usage baselines finetuning i only have a macbook benchmarking efficiency notes todos acknowledgements README.md nanoGPT nanoGPT The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of minGPT that prioritizes teeth over education. Still under active development, but currently the file train.py reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in 38 hours of training. The code itself is plain and readable: train.py is a ~300-line boilerplate training loop and model.py a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it. repro124m Because the code is so simple, it is very easy to hack to your needs, train new models from scratch, or finetune pretrained checkpoints (e.g. biggest one currently available as a starting point would be the GPT-2 1.3B model from OpenAI). install Dependencies: * pytorch <3 * numpy <3 * pip install datasets for huggingface datasets <3 (if you want to download + preprocess OpenWebText) * pip install tiktoken for OpenAI's fast BPE code <3 * pip install wandb for optional logging <3 * pip install tqdm usage To render a dataset we first tokenize some documents into one simple long 1D array of token indices. E.g. for OpenWebText run: $ cd data/openwebtext $ python prepare.py To download and tokenize the OpenWebText dataset. This will create a train.bin and val.bin which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training. The training script currently by default tries to reproduce the smallest GPT-2 released by OpenAI, i.e. the 124M version of GPT-2. We can train as follows on a single device, though I encourage you to read the code and see all of the settings and paths up top in the file: $ python train.py To train using PyTorch Distributed Data Parallel (DDP) run the script with torchrun. For example to train on a node with 4 GPUs run: $ torchrun --standalone --nproc_per_node=4 train.py Once some checkpoints are written to the output directory (e.g. ./out by default), we can sample from the model: $ python sample.py Training on 1 A100 40GB GPU overnight currently gets loss ~3.74, training on 4 gets ~3.60. Training on an 8 x A100 40GB node for ~500,000 iters (~1 day) atm gets down to ~3.1. Random chance at init is -ln(1/50257) = 10.82. Which brings us to baselines. baselines OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows: $ python train.py eval_gpt2 $ python train.py eval_gpt2_medium $ python train.py eval_gpt2_large $ python train.py eval_gpt2_xl and observe the following losses on train and val: model params train loss val loss gpt2 124M 3.11 3.12 gpt2-medium 350M 2.85 2.84 gpt2-large 774M 2.66 2.67 gpt2-xl 1558M 2.56 2.54 I briefly tried finetuning gpt2 a bit more on our OWT and didn't notice dramatic improvements, suggesting that OWT is not much much different from WT in terms of the data distribution, but this needs a bit more thorough attempt once the code is in a better place. finetuning For an example of how to finetune a GPT on new text go to data/ shakespeare and look at prepare.py to download the tiny shakespeare dataset and render it into a train.bin and val.bin. Unlike OpenWebText this will run in seconds. Finetuning takes very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like: $ python train.py config/finetune_shakespeare.py This will load the config parameter overrides in config/ finetune_shakespeare.py (I didn't tune them much though). Basically, we initialize from a GPT2 checkpoint with init_from and train as normal, except shorter and with a small learning rate. The best checkpoint (lowest validation loss) will be in the out_dir directory, e.g. in out-shakespeare by default, per the config file. You can then run the code in sample.py to generate infinite Shakespeare. Note that you'll have to edit it to point to the correct out_dir. i only have a macbook It's possible to play with the code if you only have a macbook or some other cheap computer. In this case it's much easier to just work with the Shakespeare dataset. Step 1 render the training data: $ cd data/shakespeare $ python prepare.py Then launch the training script with a baby network, here is an example: $ cd ../.. $ python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device=cpu --compile=False --eval_iters=1 --block_size=64 --batch_size=8 This creates a much smaller Transformer (4 layers, 4 heads, 64 embedding size), runs only on CPU, does not torch.compile the model (torch seems to give an error if you try), only evaluates for one iteration so you can see the training loop at work immediately, and also makes sure the context length is much smaller (e.g. 64 tokens), and the batch size is reduced to 8. On my MacBook Air (M1) this takes about 400ms per iteration. The network is still pretty expensive because the current vocabulary is hard-coded to be the GPT-2 BPE encodings of vocab_size=50257. So the embeddings table and the last layer are still massive. You can now also work with tiny shakespeare on the character level, see data/shakespeare_char and run prepare.py to tokenize it on the character level. If you have a GPU you can use the decent starter settings in a provided config file, train as follows: $ python train.py config/train_shakespeare_char.py But if all you have is a CPU you may want to further override the settings down another notch, e.g.: $ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=8 Where we decrease the context length to just 64 characters and only use a batch size of 8. benchmarking For model benchmarking bench.py might be useful. It's identical to what happens in the meat of the training loop of train.py, but omits much of the other complexities. efficiency notes Code by default now uses PyTorch 2.0. At the time of writing (Dec 29, 2022) this makes torch.compile() available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team! todos A few todos I'm aware of: Optimizations * Additional optimizations to the running time * Investigate need for an actual Data Loader with a dedicated worker process for data * Look into more efficient fused optimizers (e.g. apex) * Re-evaluate use of flash attention (previously I wasn't able to get the forward pass to match up so I took it out) * CUDA Graphs? * Investigate potential speedups from Lightning or huggingface Accelerate Features / APIs * Add back fp16 support? (would need to also add back gradient scaler) * Finetune the finetuning script, I think the hyperparams are not great * Report and track other metrics e.g. perplexity, num_tokens, MFU, ... * Eval zero-shot perplexities on PTB, WikiText, other related benchmarks Suspiciousness * Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right? * I don't currently seem to need gradient clipping but it is very often used (?). Nothing is exploding so far at these scales but maybe I'm leaving performance on the table. Evaluate with/ without. * I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty * I keep seeing different values cited for weight decay and AdamW betas, look into * I can't exactly reproduce Chinchilla paper results, see scaling_laws.ipynb notebook Results * Actually reproduce GPT-2 results and have clean configs that reproduce the result. It was estimated ~3 years ago that the training cost of 1.5B model was ~$50K (?). Sounds a bit too high. acknowledgements Thank you Lambda labs for supporting the training costs of nanoGPT experiments. About The simplest, fastest repository for training/finetuning medium-sized GPTs. Resources Readme License MIT license Stars 5.8k stars Watchers 79 watching Forks 260 forks Releases No releases published Packages 0 No packages published Contributors 5 * @karpathy * @lantiga * @nat * @jorahn * @ankandrew Languages * Jupyter Notebook 86.1% * Python 13.9% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.