https://github.com/okuvshynov/slowllama Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} okuvshynov / slowllama Public * Notifications * Fork 2 * Star 35 Finetune llama2-70b and codellama on MacBook Air without quantization License MIT license 35 stars 2 forks Activity Star Notifications * Code * Issues 1 * Pull requests 0 * Discussions * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights okuvshynov/slowllama This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/o] Use Git or checkout with SVN using the web URL. [gh repo clone okuvsh] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @okuvshynov okuvshynov slowllama: remove all print, replace with log ... 9357e87 Oct 6, 2023 slowllama: remove all print, replace with log 9357e87 Git stats * 49 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time docs slow llama: plot lora distribution September 11, 2023 09:27 static slowllama: add 70b chart September 14, 2023 16:43 test_data slowllama: combined update September 15, 2023 09:30 .gitignore slowllama: split model prepare September 27, 2023 23:45 LICENSE Initial commit August 26, 2023 14:10 README.md slowllama: cleanup a little October 5, 2023 12:45 blackbox.py slowllama: cleanup a little October 5, 2023 12:45 blackbox_model.py slowllama: remove all print, replace with log October 5, 2023 21:27 finetune.py slowllama: cleanup a little October 5, 2023 12:45 finetune_dolly.py slowllama: move greedy_gen September 30, 2023 21:16 loader.py slowllama: remove all print, replace with log October 5, 2023 21:27 merge_lora.py slowllama: move logs September 30, 2023 01:18 model_config.py slowllama: move model config out October 5, 2023 13:18 plot_lora.py slow llama: plot lora distribution September 11, 2023 09:27 prepare_model.py slowllama: cleanup a little October 5, 2023 12:45 test_gen.py slowllama: move greedy_gen September 30, 2023 21:16 utils.py slowllama: remove all print, replace with log October 5, 2023 21:27 View code [ ] slowllama Example How does it work? Experiments Llama2 7B finetune on M1 Mini (16Gb memory): Llama2 70B finetune on M1 Mini (16Gb memory) merging LoRA weights back Project structure TODO: References Contact README.md slowllama Fine-tune Llama2 and CodeLLama models, including 70B/35B on Apple M1/ M2 devices (for example, Macbook Air or Mac Mini) or consumer nVidia GPUs. slowllama is not using any quantization. Instead, it offloads parts of model to SSD or main memory on both forward/backward passes. In contrast with training large models from scratch (unattainable) or inference, where we are likely to care about interactivity, we can still get something finetuned if you let it run for a while. Current version is using LoRA to limit the updates to a smaller set of parameters. First version supported full finetuning as well, but I decided to remove it for now, more on that below. Finetuning is the only focus, there's nothing special done for inference, consider llama.cpp. For CUDA-specific experiments, see report on a10. Example Tests were done on Apple M1 with 16Gb memory and Apple M2 with 24Gb memory. In order to fine-tune llama2 model we need to: 1. Install dependencies: pip install torch sentencepiece numpy. Optional: install pip install fewlines for weight/gradient distribution logging. 2. Clone llama2 and follow instructions to download the models. The script will download tokenizer as well. tokenizer.model should be put into the same directory as llama model itself. Use codellama for CodeLLama models. Example folder structure could look like: /parent/ /slowllama/... # <- this repo /codellama/... # <-- this is Meta's codellama repository. /llama-2-7b/... # <- put tokenizer.model here /llama-2-13b/... # <- and here /llama-2-70b/... # <- and here as well /CodeLlama-34b-Python/... # and here Let's start with a tiny example. It is an intro to the description of another open-source project - cubestat. Text is short enough to just be included as part of the prompt, but it's ok as an illustration and you can read it in seconds youself. As I just published that project recently, there's no way original llama would know anything about it. Asking base llama2-7b to complete the prompt "Cubestat reports the following metrics: " results in "1) the number of cubes in the system, 2) the number of cubes that are in the process of being created". First step is to transform the model to the sequential format more suitable for loading to/from storage block-by-block. python prepare_model.py Modify the input/output paths in the script itself. Now we can try not-finetuned llama2: python test_gen.py ../llama7b mps # use path to transformed model here Now let's finetune the 7b model. finetune.py is a very simple script which trains LoRA weights based on the plaintext data. There are some settings you could change here, like sequence length, batch size, learning rate, dropout rate, number of iterations. Current settings are pretty much a guess, change this if desired. Adjust accordingly. Currently it uses AdamW optimizer. python finetune.py Here's train dataset loss: 2023-09-10 22:05:35,569 backprop done, loss after forward pass = 2.9539270401000977 2023-09-10 22:06:08,022 backprop done, loss after forward pass = 2.9073102474212646 2023-09-10 22:06:40,223 backprop done, loss after forward pass = 2.7192320823669434 2023-09-10 22:07:12,468 backprop done, loss after forward pass = 2.7223477363586426 2023-09-10 22:07:44,626 backprop done, loss after forward pass = 2.5889995098114014 2023-09-10 22:08:16,899 backprop done, loss after forward pass = 2.4459967613220215 2023-09-10 22:08:49,072 backprop done, loss after forward pass = 2.3632657527923584 2023-09-10 22:09:21,335 backprop done, loss after forward pass = 2.250361442565918 2023-09-10 22:09:53,511 backprop done, loss after forward pass = 2.165428638458252 2023-09-10 22:10:25,738 backprop done, loss after forward pass = 2.031874656677246 2023-09-10 22:13:45,794 backprop done, loss after forward pass = 1.8926434516906738 2023-09-10 22:14:18,049 backprop done, loss after forward pass = 1.7222942113876343 2023-09-10 22:14:50,243 backprop done, loss after forward pass = 1.58726966381073 2023-09-10 22:15:22,405 backprop done, loss after forward pass = 1.4983913898468018 2023-09-10 22:15:54,598 backprop done, loss after forward pass = 1.296463131904602 2023-09-10 22:16:26,909 backprop done, loss after forward pass = 1.3328818082809448 2023-09-10 22:16:59,031 backprop done, loss after forward pass = 1.0978631973266602 2023-09-10 22:17:31,200 backprop done, loss after forward pass = 1.018444538116455 2023-09-10 22:18:03,406 backprop done, loss after forward pass = 0.8421685099601746 2023-09-10 22:18:35,673 backprop done, loss after forward pass = 0.7168515920639038 2023-09-10 22:21:55,482 backprop done, loss after forward pass = 0.7870235443115234 I didn't add a validation set for this data, instead I just checked what would the fine-tuned model produce for the same prompt. At ~10 iteration we get the following reasonable output: Cubestat reports the following metrics: 1. CPU usage, 2. Memory usage, 3. Disk usage At ~20 iteration another output is produced: 0 - Cubestat reports the following metrics: CPU utilization: Efficiency and Performance cores. Shows as percentage. Maybe we were overfitting already at this point. Running completion with newly produced lora checkpoint can be done like this: python test_gen.py ../llama-2-7b mps ./data/state_dict_29.pth How does it work? For all versions the process is roughly the same. First, we need to be able to load a model which requires more RAM than we have and save it back in sequential format. We create model instance with all large modules' weights offloaded to SSD - all of the transformer blocks, token embeddings and output linear layer. After that we load model shards one by one, for each shard iterate over all modules, update corresponding subset of its weights and save it back. Doing forward path is easy - we just load modules when we need and pass the output forward. Backward pass is a little more tricky, in a way we have to run forward pass twice. The way it's currently implemented is: 1. Do a forward pass while also saving inputs to each offloaded block to the SSD. The goal of the first forward pass is to compute the final loss and cache inputs to each offloaded block. 2. Then, do a manual backward gradient propagation. We start from the last block, re-run each block once again (forward, to build autograd graph) with the same input we cached on step (1). After that we run backward pass within that block only, and pass the gradient for the input to the next (previous?) block. As we use LoRA, only LoRA gradients are being saved. LoRA weights are not offloaded to disk, always staying on RAM/GPU. Important: we also need to save and restore random number generation state before evaluating each offloaded module. During training we use dropout, and randomly switched off neurons should be the same on both forward passes. 3. After that we run optimizer step on LoRA weights and save them separately if needed. Original llama2 weights are in bfloat16, but mps backend doesn't support that type natively, so we do computation in float32 instead. Experimental version of slowllama which can be still found here was capable of doing full finetuning and update all weights pretty much the same way. I've temporarily removed that feature to preserve the lifespan of SSDs, as frequent write operations can degrade performance over time. Reading from SSDs isn't an issue, but they do have a write limit. Limit is typically high enough for normal usage, but in the case of full finetunining we'll have to write ~150Gb per one iteration/weight update of 70B variant, assuming stateless optimizer and no gradient accumulation. With AdamW we'll have to save /update another 150Gb more of optimizer state per iteration. If, for example, we assume 1Pb of writes before SSD will start having issues, even 100 iterations of finetuning would incur significant cost/risk. For machines with GPUs and large amount of RAM we can skip the disk entirely and offload to RAM only. It should be possible to bring full finetuning back for main-memory-only offload. On the other hand, if everything fits into memory, there's no need to do whole 'evaluate twice' thing, might just use fairscale instead and only move tensors between GPU/CPU. Experiments Llama2 7B finetune on M1 Mini (16Gb memory): finetune on mac mini Here we can see resource utilization for 1 full iteration on 7B model - forward and manual backward passes. Each column == 1 second. A few notes: 1. GPU is reasonably well utilized; 2. First forward pass has lower GPU utilization and spends more time on IO as we need to both read weights and write cached inputs/ outputs 3. Backward (combined?) pass achieves very high GPU utilization, close to 100% 4. As we move along layers back and forth, right after each 'direction switch' we process layers in LIFO order. Thus in the beginning of both forward and backward pass we don't have to access disk, weights are being cached and we don't see disk reads. batch_size/seq_len - works ok with, say, 2048 seq_len and batch_size = 2. Llama2 70B finetune on M1 Mini (16Gb memory) finetune 70b model The chart here has different granularity - each column is 30 seconds. Input data was also different - it is the readme file you are reading now. I didn't have enough free space on disk to store both original weights (140Gb) + weights in sequential format we use (another 140Gb). In order to still be able to finetune this model, I stored original weights on much slower external SD card, as we need to read them only once. Weights in sequential format on fast internal SSD. With batch size = 16 and sequence length = 128 it was taking ~25-30 min per iteration. As we can see, GPU utilization doesn't look that great - we might be able to benefit from prefetching next transformer block, assuming we have enough memory for storing 2 layers. Memory utilization peaked at around 80% of 16Gb. Loss over time: 2023-09-13 17:30:28,731 backprop done, loss after forward pass = 2.431253433227539 2023-09-13 18:00:00,133 backprop done, loss after forward pass = 2.604712963104248 2023-09-13 18:29:36,473 backprop done, loss after forward pass = 2.6277880668640137 2023-09-13 19:00:40,463 backprop done, loss after forward pass = 2.408756971359253 2023-09-13 19:29:55,974 backprop done, loss after forward pass = 2.6121537685394287 2023-09-13 19:59:04,849 backprop done, loss after forward pass = 2.428431987762451 2023-09-13 20:27:03,760 backprop done, loss after forward pass = 2.4040215015411377 2023-09-13 20:55:56,969 backprop done, loss after forward pass = 2.158071279525757 2023-09-13 21:25:04,615 backprop done, loss after forward pass = 2.3459620475769043 2023-09-13 21:54:07,128 backprop done, loss after forward pass = 2.2933709621429443 2023-09-13 23:18:57,588 backprop done, loss after forward pass = 2.273494243621826 2023-09-13 23:48:05,310 backprop done, loss after forward pass = 2.4055371284484863 2023-09-14 00:17:19,113 backprop done, loss after forward pass = 2.2604546546936035 2023-09-14 00:46:31,872 backprop done, loss after forward pass = 2.552386522293091 2023-09-14 01:15:45,731 backprop done, loss after forward pass = 2.297588586807251 2023-09-14 01:44:51,640 backprop done, loss after forward pass = 2.1217401027679443 2023-09-14 02:14:09,033 backprop done, loss after forward pass = 1.9815442562103271 2023-09-14 02:43:09,114 backprop done, loss after forward pass = 2.020181179046631 2023-09-14 03:12:17,966 backprop done, loss after forward pass = 2.0041542053222656 2023-09-14 03:41:20,649 backprop done, loss after forward pass = 1.9396495819091797 2023-09-14 05:06:31,414 backprop done, loss after forward pass = 2.1592249870300293 2023-09-14 05:35:39,080 backprop done, loss after forward pass = 1.976989984512329 2023-09-14 06:04:57,859 backprop done, loss after forward pass = 1.7638890743255615 2023-09-14 06:34:06,953 backprop done, loss after forward pass = 1.9829202890396118 2023-09-14 07:03:18,661 backprop done, loss after forward pass = 1.754631519317627 2023-09-14 07:32:26,179 backprop done, loss after forward pass = 2.027863025665283 2023-09-14 08:01:37,546 backprop done, loss after forward pass = 1.8579339981079102 2023-09-14 08:30:41,689 backprop done, loss after forward pass = 1.7934837341308594 2023-09-14 08:59:55,921 backprop done, loss after forward pass = 1.794022798538208 2023-09-14 09:28:59,690 backprop done, loss after forward pass = 1.750269889831543 2023-09-14 10:56:19,282 backprop done, loss after forward pass = 1.4310824871063232 2023-09-14 11:25:28,462 backprop done, loss after forward pass = 1.6895856857299805 2023-09-14 11:54:39,973 backprop done, loss after forward pass = 1.5074403285980225 2023-09-14 12:23:42,604 backprop done, loss after forward pass = 1.6695624589920044 2023-09-14 12:53:00,535 backprop done, loss after forward pass = 1.4220315217971802 2023-09-14 13:22:15,685 backprop done, loss after forward pass = 1.5720497369766235 2023-09-14 13:51:30,744 backprop done, loss after forward pass = 1.544579267501831 2023-09-14 14:20:44,482 backprop done, loss after forward pass = 1.2813694477081299 2023-09-14 14:50:03,384 backprop done, loss after forward pass = 1.2990479469299316 2023-09-14 15:19:09,620 backprop done, loss after forward pass = 1.0500637292861938 We used prompt 'slowllama is a ', and here you can see the completions: * before any weight update: slowllama is a 24 year old (DOB: December 25, 1994) pure-blood witch * after 10 iterations: slowllama is a 24 year old (DOB: December 25, 1994) pure-blood witch * after 20 iterations: slowllama is a 70B model trained on the same data as llama.70b, but with a different training setup. * after 30 iterations: slowllama is a 2022 fork of llama2, which is a 2021 fork of llama, which is a 2020 fork * after 40 iterations: slowllama is a 2-stage finetuning implementation for llama2. Current setup is probably too slow for 70B model finetuning on old mac mini M1. It would be interesting to try it on more recent hardware (say, M2 Max / M2 Pro), implement prefetch/async save and see how it's going to work. merging LoRA weights back In order to merge LoRA checkpoint back to the model in original format, we can do the following: # confirm that old model is producing wrong output python test_gen.py ../llama-2-7b mps # ... # 0 - slowllama is a 24 year old (DOB: May 1, 1997) pure-blood witch # check what would be the output for finetuned model by passing path to checkpoint python test_gen.py ../llama-2-7b mps ./data/state_dict_29.pth # ... # 0 - slowllama is a 100% static, 100% offline, 100% open source, 100% free, # now run merge. we need to pass: # - original model path # - new path for new model # - lora checkpoint path # - optionally number of model shards (default = 1) python merge_lora.py ../llama-2-7b ./data/state_dict_29.pth ../llama-2-7b-out # copy tokenizer model over: cp ../llama-2-7b/tokenizer.model ../llama-2-7b-out/ # now run new model with no extra checkpoint, observe new output, same as in combined model: python test_gen.py ../llama-2-7b-out mps # ... # 0 - slowllama is a 100% static, 100% offline, 100% open source, 100% free, Now the ../llama-2-7b-out can be used in exactly same way as original llama2 for further quantization, inference, etc. Project structure Just a few files with no dependencies other than torch, numpy and sentencepiece for tokenizer. 1. blackbox_model.py -- model definition and manual backprop implementation. It's based on model.py from llama2.c, also MIT licenced. 2. finetune.py - script which does the training 3. loader.py - manual loading/saving of large llama2 models 4. utils.py - small utility functions, including saving/loading random generator state for different devices. 5. test_gen.py - greedily complete the prompt. Takes base weights + trained LoRA weights as input. Useful for sanity checks. 6. blackbox.py - module wrapper which offloads the module to disk or main memory. 7. plot_lora.py - logging utility, writes LoRA weights and gradient distribution to logfile. Requires fewlines. If fewlines is not installed, does nothing. 8. merge_lora.py - merge original weights + lora weights in the original format which can then be used directly. 9. prepare_model.py - script to transform sharded model to sequentially split model. TODO: [x] merge lora weights with base model weights and export the combined result in original format. [x] better way to merge - no need to read sequential format. [x] save params.json [ ] masking [ ] more generic train routine [ ] how to make it work with fp16 on Apple? [x] reimplement tokenizer [ ] optimizations - prefetch the next layer/input, save asyncronously, etc; [ ] gradient accumulation [ ] plot something like memory requirement for (batch_size , seq_len) [x] check if/how it works on CUDA; [x] rope -- double-check the values in original checkpoint vs what's being computed. [x] make lora params (rank, alpha, dropout) easily configurable; [x] try RAM offload [x] AdamW [x] logging weight/gradient distribution [ ] combined RAM/disk offload - 200Gb RAM is rarity. [ ] tests, cleanup and comments; [ ] progress tracking for everything; [x] quantization? at least 16 bit?; [ ] quantization beyond 16 bit? [x] improve model loading time; [x] configurable weight tying; [ ] double check RNG state correctness. References * llama2 * llama.cpp * llama2.c * cubestat * LoRA Contact {github handle} @ gmail.com About Finetune llama2-70b and codellama on MacBook Air without quantization Topics llama fine-tuning apple-silicon llama2 Resources Readme License MIT license Activity Stars 35 stars Watchers 1 watching Forks 2 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time.