https://old.reddit.com/r/LocalLLaMA/comments/188197j/80_faster_50_less_memory_0_accuracy_loss_llama/ jump to content my subreddits edit subscriptions * popular * -all * -random * -users | * AskReddit * -pics * -funny * -gaming * -worldnews * -news * -movies * -explainlikeimfive * -mildlyinteresting * -videos * -todayilearned * -DIY * -nottheonion * -tifu * -TwoXChromosomes * -aww * -dataisbeautiful * -OldSchoolCool * -LifeProTips * -Music * -books * -science * -Jokes * -Showerthoughts * -Futurology * -sports * -askscience * -IAmA * -space * -gifs * -gadgets * -UpliftingNews * -history * -nosleep * -food * -InternetIsBeautiful * -announcements * -WritingPrompts * -philosophy * -Documentaries * -GetMotivated * -creepy * -EarthPorn * -photoshopbattles * -listentothis * -blog more >> reddit.com LocalLLaMA * comments Want to join? Log in or sign up in seconds.| * English [ ][] [ ]limit my search to r/LocalLLaMA use the following search parameters to narrow your results: subreddit:subreddit find submissions in "subreddit" author:username find submissions by "username" site:example.com find submissions from "example.com" url:text search for "text" in url selftext:text search for "text" in self post contents self:yes (or self:no) include (or exclude) self posts nsfw:yes (or nsfw:no) include (or exclude) results marked as NSFW e.g. subreddit:aww site:imgur.com dog see the search faq for details. advanced search: by author, subreddit... this post was submitted on 01 Dec 2023 513 points (98% upvoted) shortlink: [https://redd.it/1881] [ ][ ] [ ]remember mereset password login Submit a new link Submit a new text post Get an ad-free experience with special benefits, and directly support Reddit. get reddit premium LocalLLaMA joinleave82,817 readers 2,611 users here now --------------------------------------------------------------------- r/LocalLLaMA A subreddit to discuss about Llama, the family of large language models created by Meta AI. --------------------------------------------------------------------- Useful Links Subreddit rules Index resources Search by flair +Discussion +Tutorial | Guide +New Model +News +Other a community for 8 months MODERATORS * message the mods discussions in r/LocalLLaMA <> X 511 * 185 comments [jNpzm1Lh] 80% faster, 50% less memory, 0% accuracy loss Llama finetuning * 12 comments #1 on LLM Leaderboard for 7B: Chupacabra-v2 - WE MADE IT! 52 * 35 comments RAG + real TXT book + Yi34b-chat = creative writing beast 15 * 1 comment Notus-7B-v1, new OSS LLM trained with DPO and cleaned version of UltraFeedback 25 * 88 comments Is there really no way you can run 70b models without having a very fast GPU or a lot of ram? 21 * 33 comments Multiple 4090s instead of H100? 9 * 6 comments What are the highest performing model less than 500M Parameters 19 * 11 comments LLama with RAG 16 * 6 comments [37ALdrax] Incoming: TensorRT-LLM version 0.6 with support for MoE, new models and more quantization 420 * 66 comments What All Dropped Recently: Welcome to Reddit, the front page of the internet. Become a Redditor and join one of thousands of communities. x 512 513 514 [jNpzm1Lh] 80% faster, 50% less memory, 0% accuracy loss Llama finetuning Tutorial | Guide (self.LocalLLaMA) submitted 19 hours ago * by danielhanchen Hey r/LocalLLaMA community! Just launched our open source 5x faster finetuning package Unsloth https://github.com/unslothai/unsloth where you can finetune Llama models: * 5x faster * Use 50% less memory * With 0% loss in accuracy * All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, A100, H100s) for free! * QLoRA / LoRA is now 80% faster to train. We manually hand derived backpropagation steps, wrote all kernels in OpenAI's Triton language and applied some more maths and coding trickery. You can read more about our tricks via https://unsloth.ai/ introducing. I wrote a Google Colab for T4 for Alpaca: https:// colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp =sharing which finetunes Alpaca 2x faster on a single GPU. On Kaggle via 2 Tesla T4s on DDP: https://www.kaggle.com/ danielhanchen/unsloth-laion-chip2-kaggle, finetune LAION's OIG 5x faster and Slim Orca 5x faster. 5X faster finetuning on Slim Orca - 1301 hours to now 260 hours. You can install Unsloth all locally via: pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git" pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git" Currently we only support Pytorch 2.1 and Linux distros - more installation instructions via https://github.com/unslothai/unsloth/ blob/main/README.md We hope to: 1. Support other LLMs other than Llama style models 2. Add sqrt gradient checkpointing to shave another 25% of memory usage. 3. And other tricks! * 185 comments * share * save * hide * report all 185 comments sorted by: best topnewcontroversialoldrandomq&alive (beta) [ ] Want to add to the discussion? Post a comment! Create an account [-]21022018 39 points40 points41 points 19 hours ago (2 children) How does this compare to QLoRA or LoRA? * permalink * embed * save * report * reply [-]danielhanchen[S] 47 points48 points49 points 19 hours ago (1 child) Oh it makes QLoRA 80% faster!!! So if you already used QLoRA, it makes it faster. I also support LoRA, which also makes it faster - a bit less of a speedup though. I editted the post to mention it :) * permalink * embed * save * parent * report * reply [-]mcmoose1900 1 point2 points3 points 49 minutes ago (0 children) Does it reduce VRAM usage much? Also, either way, this super cool and awesome. Its insane that everyone is training llama in eager mode. I'm looking forward to the planned DPO training as well. * permalink * embed * save * parent * report * reply [-]Kindly-Abroad-3781 37 points38 points39 points 19 hours ago (7 children) Thank you so much for this awesome open-source work! From what I gather on your blog, all the improvements are a result of Manual autograd and switching all the kernels to OpenAI's Triton kernel, right? * permalink * embed * save * report * reply [-]danielhanchen[S] 22 points23 points24 points 19 hours ago (2 children) https://unsloth.ai/introducing has more deets on the manual autograd methods and Triton kernels. * other coding tricks like inplace operations, reduced memory movements, etc. * permalink * embed * save * parent * report * reply [-]Kindly-Abroad-3781 5 points6 points7 points 18 hours ago (1 child) Awesome, looking forward to the new blog! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 18 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 21 points22 points23 points 19 hours ago (3 children) I might write a full blog about all the changes we did if you're interested * permalink * embed * save * parent * report * reply [-]Kindly-Abroad-3781 3 points4 points5 points 14 hours ago (1 child) r datasets if that works - do you hav I just had a quick look at the source code of Unsloth, and surprisingly, Even though the open version already implemented acceleration strategies like flash attention, the max and pro versions of Unsloth can actually boost training speed by more than 5 times. If possible, I'm really looking forward to learning about the strategies used in the max/pro version. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0 children) Oh ye you can boost it even further with more maths and coding hacks! * permalink * embed * save * parent * report * reply [-]Relevant_Outcome_726 24 points25 points26 points 19 hours ago (15 children) Can we use this for fine-tuning Mistral? * permalink * embed * save * report * reply [-]danielhanchen[S] 60 points61 points62 points 19 hours ago (14 children) Currently no - I will push some changes to allow it in a few days - technically Mistral's model arch is the same as Llama, so it should be an easy change - I'll msg you once it's done * permalink * embed * save * parent * report * reply [-]OnY86 2 points3 points4 points 15 hours ago (3 children) Nice to hear! Message me too please, thanks! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (2 children) Cool! * permalink * embed * save * parent * report * reply [-]BayesMind 1 point2 points3 points 14 hours ago (1 child) mistral sounds great thank you!, I'll sub to your repo for updates! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]UserMinusOne 1 point2 points3 points 15 hours ago (1 child) Nice to hear! Message me too please, thanks! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Yep! * permalink * embed * save * parent * report * reply [-]Paulonemillionand3 1 point2 points3 points 14 hours ago (3 children) I'm also looking for exactly that! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 14 hours ago (2 children) :) * permalink * embed * save * parent * report * reply [-]DickMasterGeneral 2 points3 points4 points 13 hours ago (1 child) Me too, if you don't mind * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 12 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]Tiny_Arugula_5648 1 point2 points3 points 11 hours ago (1 child) me too! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 11 hours ago (0 children) cool!! * permalink * embed * save * parent * report * reply [-]Square-Tooth2635 -1 points0 points1 point 9 hours ago (1 child) !RemindMe 7days * permalink * embed * save * parent * report * reply [-]RemindMeBot 1 point2 points3 points 9 hours ago* (0 children) I will be messaging you in 7 days on 2023-12-08 13:07:28 UTC to remind you of this link 3 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam. ^Parent commenter can ^delete this message to hide from others. --------------------------------------------------------------------- ^Info ^Custom ^Your Reminders ^Feedback * permalink * embed * save * parent * report * reply [-]silenceimpaired 24 points25 points26 points 15 hours ago (15 children) Never trained... wish you could have a soo you've never trained guide :) * permalink * embed * save * report * reply [-]danielhanchen[S] 42 points43 points44 points 15 hours ago (13 children) Oh so like a full step up step guide on training a dataset - even the dataset prep stage etc? * permalink * embed * save * parent * report * reply [-]silenceimpaired 25 points26 points27 points 15 hours ago (6 children) Yup. I know the data set could just be a plain text file... but people see json all the time and aren't sure what to make of that.. or how to get started. A simple walkthrough encourages people to explore the scary alien terrain :) * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 20 points21 points22 points 15 hours ago (5 children) Oh interesting - I'll write up an example - I'll ping you once it's done! * permalink * embed * save * parent * report * reply [-]thewayupisdown 1 point2 points3 points 10 hours ago (0 children) Me too, please and thank you! * permalink * embed * save * parent * report * reply [-]potatodioxide 1 point2 points3 points 9 hours ago (0 children) me too please! * permalink * embed * save * parent * report * reply [-]pmp22 0 points1 point2 points 9 hours ago (0 children) Ping me too please! Also, a small table with model size and hardware requirements would be nice, to get a ballpark for what hardware is needed for what etc. Say I have a 4090, what can I fine tune with that and how long will it take? * permalink * embed * save * parent * report * reply [-]letchhausen 0 points1 point2 points 1 hour ago (0 children) Me, too, please! * permalink * embed * save * parent * report * reply [-]Koliham 5 points6 points7 points 15 hours ago (1 child) An guide to train with example datasets, that we don't make mistakes in the Instruct format would be great * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 5 points6 points7 points 14 hours ago (0 children) I have some Colab notebooks - https://colab.research.google.com/drive /1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing for Alpaca. I can make more for other datasets if that works - do you have any suggestions? * permalink * embed * save * parent * report * reply [-]psdwizzard 1 point2 points3 points 10 hours ago (1 child) I would really like to see that too * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 8 hours ago (0 children) Coolies! * permalink * embed * save * parent * report * reply [-]jwyer 1 point2 points3 points 10 hours ago (1 child) That would be great and help out a lot of people, making llms even more accessible. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children) I'll make one!! * permalink * embed * save * parent * report * reply [-]azriel777 0 points1 point2 points 1 hour ago (0 children) We really need a train your first A.I. Model for dummies book. * permalink * embed * save * parent * report * reply [-]Aaaaaaaaaeeeee 8 points9 points10 points 18 hours ago (11 children) Currently, I can finetune a 34B on 24gb at a maximum of 192 ctx at rank 8 with the huggingface model at 4bit. * I have a feeling the hf 4bit model is too large, is this able to shrink the size, or just the excess post model loading? * If I used a smaller bpw GPTQ model, could still use the library? * permalink * embed * save * report * reply [-]danielhanchen[S] 8 points9 points10 points 17 hours ago (2 children) Haven't tried it on 34B yet, but it should also reduce mem usage by 50%, ie your batches can be approx 6x larger according to our matrix size calculations. But essentially we still load the model as 4bit, then do all the memory shrinking during the training process * permalink * embed * save * parent * report * reply [-]Aaaaaaaaaeeeee 5 points6 points7 points 16 hours ago (1 child) You legend. Thanks for sharing these optimizations! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 3 points4 points5 points 16 hours ago (0 children) Thanks! * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 3 points4 points5 points 15 hours ago (7 children) I am fine-tuning yi-34b on 24gb 3090 ti with ctx size 1200 using axolotl. If you want some tips and tricks with it I can help you to get up to what I am getting. I haven't tried unsloth yet but I am a touch sceptical. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 14 hours ago (4 children) Oh I'm not sure if Yi is supported - I heard it's just Llama's arch so I'll make it work - Axolotl is cool though! * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 2 points3 points4 points 14 hours ago (1 child) I am fine-tuning on the llama-fied yi-34b https://huggingface.co/ chargoddard/Yi-34B-Llama/tree/llama-tokenizer It's the same structure as llama, so unless someone hardcoded parameters like number of heads, layers hidden sizes and all of those magic numbers, software that supports llama 1 33b should also support yi-34b-llama without any patches. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0 children) Oh wait it's also "LlamaForCausalLM" - it should work then - i just haven't verified fully if grouped query attention works as expected - hopefully my handling of it works * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 2 points3 points4 points 14 hours ago (1 child) If it's possible to use unsloth to train 34B model in qlora with context length of 4096 on 24GB GPU it would be a big deal. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 13 hours ago (0 children) Probably? I haven't tried it out loll I'll probably run it on a A100 instance via Colab and see the peak memory usage. I think 4096 is fine, since at 2048 for 7B, the max batch size I found to work was around 14!! * permalink * embed * save * parent * report * reply [-]Aaaaaaaaaeeeee 0 points1 point2 points 12 hours ago (1 child) I would appreciate it! You could share a config or maybe make a post with tips for other lone 3090s to replicate. I don't have my setup full optimized because I still use 0.5-0.6 for my monitor. * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 1 point2 points3 points 11 hours ago (0 children) Config is here https://huggingface.co/adamo1139/ Yi-34B-Spicyboros-2-2-run3-QLoRA/tree/main/config Secret sauce is to enable flash attention and disable sample packing. Something like 1400-1700 ctx should be achievable if you run the pc without monitor or use igpu. I saved 10 bucks on buying Intel cpu with igpu fused off and it's biting me into the ass now. * permalink * embed * save * parent * report * reply [-]EntertainmentBroad43 6 points7 points8 points 19 hours ago (1 child) How's the memory consumption compared to Qlora? * permalink * embed * save * report * reply [-]danielhanchen[S] 16 points17 points18 points 19 hours ago (0 children) Apologies didn't mention it - the 80% faster is making QLoRA / LoRA itself 80% faster and use 50% less memory. So on the Open Assistant dataset, memory usage via QLoRA is shaved from 14GB to 7.8GB on bsz = 2, ga = 4. You can now fit even larger batches via QLoRA * permalink * embed * save * parent * report * reply [-]ExtensionCricket6501 6 points7 points8 points 18 hours ago (3 children) Interesting, any estimates for the minimum vram requirement to train the llama variants now? (7b,13b,34b,70b), seems like VRAM reduces a lot on just Open. * permalink * embed * save * report * reply [-]danielhanchen[S] 19 points20 points21 points 18 hours ago (2 children) Oh yes so depending on the dataset for eg Alpaca takes 6.8GB of VRAM on batch size = 2. If you do bsz=1, it'll be even less - I haven't tested it yet. On OASST, VRAM is reduced to 7.8GB from 14GB. For 13B - I don't have the numbers but also 50% reduction. On 34B and 70B sadly haven't tested it yet - will do so - but presumably again 50% reduction. * permalink * embed * save * parent * report * reply [-]g3t0nmyl3v3l 8 points9 points10 points 16 hours ago (1 child) Dude that is insane. Amazing work, you rock! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 7 points8 points9 points 16 hours ago (0 children) Thanks a bunch!!! * permalink * embed * save * parent * report * reply [-]CjqM8012 6 points7 points8 points 16 hours ago (1 child) Nice work! I am yet to go through the blog, but, could any of these optimizations be applied to inference aswell? * permalink * embed * save * report * reply [-]danielhanchen[S] 9 points10 points11 points 16 hours ago (0 children) Thanks! Yep - working on inference now!! * permalink * embed * save * parent * report * reply [-]Tasty-Lobster-8915 5 points6 points7 points 19 hours ago (14 children) I would like to try this! Can you give an example of a full tune script? * permalink * embed * save * report * reply [-]danielhanchen[S] 8 points9 points10 points 19 hours ago (13 children) Thanks! We have complete examples via Google Colab: https:// colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp =sharing for Alpaca, and LAION's OIG via Kaggle on 2 GPUs: https:// www.kaggle.com/danielhanchen/unsloth-laion-chip2-kaggle Both are free to run! * permalink * embed * save * parent * report * reply [-]Tasty-Lobster-8915 3 points4 points5 points 19 hours ago (12 children) Thanks for those. In both of the links you sent, I see the Lora rank and targets are set during initialisation? Do you have an example of how to run a full finetune of all parameters (non-LORA)? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 19 hours ago (11 children) Ohhh a full finetune - currently sadly it's not supported - only QLoRA for now sorry. * permalink * embed * save * parent * report * reply [-]Tasty-Lobster-8915 2 points3 points4 points 19 hours ago (10 children) Ahh.. any plans for support in the future? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 3 points4 points5 points 19 hours ago (9 children) Technically yes, but sadly since my bro and I are fully bootstrapping this as a startup, we decided to push it with our Pro and Max plans - we're still not sure how to monetize it yet - as a platform? Sell the code? Etc * permalink * embed * save * parent * report * reply [-]OVAWARE 5 points6 points7 points 18 hours ago (1 child) Well you should start with donations, its not much but its easy to setup and can help you get started, then maybe you can sell a API service for training? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 17 hours ago (0 children) Ye good point!! I'll ask my bro for this - thanks so much for the help! * permalink * embed * save * parent * report * reply [-]Tasty-Lobster-8915 2 points3 points4 points 17 hours ago (1 child) I'm still potentially interested depending on your price point! Looking forward to when your "pro" and "max" versions release! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children) :)! Having discussions on pricing and stuff - just not sure how we're gonna approach it - if you have any have any pricing ranges you might feel that is right - that'll be sick! * permalink * embed * save * parent * report * reply [-]Crafty-Run-6559 2 points3 points4 points 17 hours ago* (3 children) Sell easy Qloras for $ per hour. Make it a simple upload your training data (better yet, provide a bunch of different sets for use) tune your settings/hyperparameters, and wait for an emailed link to your qlora. People will pay for that, and it's recurring revenue. If you release the core training code like you have, then it makes it easy for people to trust it. Just start releasing it under the same license as Mongo or AGPL. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 17 hours ago (2 children) Ye a finetuning platform! One issue I'm still figuring out is somehow integrating GPUs via AWS / Google Cloud - I was trying to say hook up Colab internally to run it since we found Colab to be the cheapest * permalink * embed * save * parent * report * reply [-]Crafty-Run-6559 2 points3 points4 points 17 hours ago (1 child) Could always start off with some used 4090s or 3090s lol It's background batch processing with relatively low bandwidth requirements. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 16 hours ago (0 children) Yeee I thought about that - it's not a bad point I guess - thanks for the ideas - I'll chat with my bro more about this! Appreciate it! * permalink * embed * save * parent * report * reply [-]SmolGnoll 0 points1 point2 points 4 hours ago (0 children) You will get hired on the back of this. Advertise your contacts, publish a paper. Also, I am very interested in whether these optimisations can be applied to full fine tunes. * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 4 points5 points6 points 18 hours ago (13 children) What's the reason that this is faster? Custom kernels? * permalink * embed * save * report * reply [-]danielhanchen[S] 17 points18 points19 points 18 hours ago (12 children) Custom kernels in Triton, Flash Attention, inplace ops, manual derivation of matrix differentials, chained matrix bracketing, reduced data movement and more!!! https://unsloth.ai/introducing has more deets :) I'll write up a full blog post if you're interested! * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 3 points4 points5 points 18 hours ago (11 children) That would be appreciated! I wonder if they could integrate these into BnB, that could be very fast LOL. I guess there's ExllamaV2. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 18 hours ago (10 children) Oh ye that would be cool! I'll talk with Tim Dettmers from BnB about it! * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child) Or maybe integrate into Transformers itself and/or PEFT/Trainer? Would be huge. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 18 hours ago (0 children) Ye good point - I'll see what I can do with my bro! :) * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 4 points5 points6 points 18 hours ago (3 children) Also, can you share more information on Unsloth Pro and Max? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 18 hours ago (2 children) Ye so Pro makes training even faster from 5X to 28X ish faster, supports multi GPU training. Max further speeds it up to 31x, but the difference is Max makes it possible to work on Intel, AMD GPUs, and supports full finetuning and training. * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child) That sounds nice, can you provide detail on the further optimizations? Or is that a secret sause? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 3 points4 points5 points 17 hours ago (0 children) So our blog https://unsloth.ai/introducing has a bit more - but for the Pro and Max versions - that's our specialty! :) If you're interested I'll write a detailed blog post about all the changes we made in the open source version * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 2 points3 points4 points 18 hours ago (3 children) Sorry for multiple comments like this, but maybe CUDA kernels are faster? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 18 hours ago (2 children) I found CUDA kernels for non jitted code to be faster - ie if you run CUDA kernels only once or twice since there's a JIT compiling cost via Triton. In general, CUDA and Triton are equal in terms of speed - Triton more so since you can try out more hypotheses. * permalink * embed * save * parent * report * reply [-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child) Interesting, thanks. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]Kgcdc 5 points6 points7 points 18 hours ago (2 children) Will this work on my SMC box with 10 L40S? Happy to give you access to test if needed. * permalink * embed * save * report * reply [-]danielhanchen[S] 4 points5 points6 points 18 hours ago (1 child) Hey! I was just about to test it via Google Cloud's L40 instances! So via DDP (Deepspeed still in the works), our other offerings Pro and Max allow support. I'm bootstrapping this as a startup with my brother, so sadly we decided to make it a paid component to cover our life expenses. Can chat if you're interested! * permalink * embed * save * parent * report * reply [-]Kgcdc 1 point2 points3 points 18 hours ago (0 children) I have L40S not L40. But let's chat since we are looking for the right inference server. * permalink * embed * save * parent * report * reply [-]Techyogi 11 points12 points13 points 19 hours ago (1 child) Any chance for apple silicon support?? * permalink * embed * save * report * reply [-]danielhanchen[S] 15 points16 points17 points 19 hours ago (0 children) Currently no sadly - I don't know how to write Apple kernels, but technically because everything is written in Triton, it should work for AMD and Intel GPUs as well. On CPUs - maybe in the future via BLAS and C++ code if people are interested. * permalink * embed * save * parent * report * reply [-]CanIstealYourDog 3 points4 points5 points 17 hours ago (4 children) I'm fine tuning Llama 2 7B using QLora on Nvidia A6000. Would this work for that? * permalink * embed * save * report * reply [-]Aaaaaaaaaeeeee 3 points4 points5 points 16 hours ago (2 children) * All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, A100, H100s (Ampere is supported). * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 16 hours ago (1 child) Thanks! Yep Ampere! Hopper etc! Oops maybe I should have wrote that * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 1 point2 points3 points 16 hours ago (0 children) Yep! * permalink * embed * save * parent * report * reply [-]ambientswan 3 points4 points5 points 14 hours ago (1 child) Thank you for your work! Any chance for this support Apple Silicon/ Metal? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Thanks! Yep maybe in the future! * permalink * embed * save * parent * report * reply [-]iLaurens 4 points5 points6 points 13 hours ago (1 child) The pricing page for unsloth pro says this as header: "Unlock our 30x faster algorithm for multiple GPUs" But then in the bullets below it says "single GPU only". So what's the deal with pro? Is it single or multi gpu training? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 10 hours ago (0 children) OHH ye we're still figuring it out as we go along - much apologies - after discussions with people and my bro - the Pro will in fact be multi GPU supported, and most likely at the price of a game for hobbyists - the issue is we didn't expect the Pro/Max to have interest - our goal was to first showcase the OSS one, and so we didn't really plan for the Pro/Max yet. I'll update the details once it's all confirmed * permalink * embed * save * parent * report * reply [-]stormer0 3 points4 points5 points 7 hours ago (0 children) the talent posting here is pretty insane. Blows my mind how quickly people are iterating on this. Thank god for open source * permalink * embed * save * report * reply [-]tgredditfc 2 points3 points4 points 18 hours ago (3 children) Awesome! I really need to reduce vRAM usage as I need to train with cutoff length of 2048 which costs tremendous vRAM! Can I run it in WSL? * permalink * embed * save * report * reply [-]danielhanchen[S] 4 points5 points6 points 18 hours ago (1 child) It would be fabulous if you can report back to see if it works - I can also help debug the installation if that helps * permalink * embed * save * parent * report * reply [-]tgredditfc 2 points3 points4 points 18 hours ago (0 children) Will do! * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 18 hours ago (0 children) WSL should work hopefully? I'm not 100% sure have not tried it - but hopefully it works * permalink * embed * save * parent * report * reply [-]wishtrepreneur 2 points3 points4 points 18 hours ago (1 child) When do you have mistral finetuning planned? * permalink * embed * save * report * reply [-]danielhanchen[S] 2 points3 points4 points 17 hours ago (0 children) In the next few days - I'll ping you! * permalink * embed * save * parent * report * reply [-]No-Link-2778 2 points3 points4 points 17 hours ago (1 child) What about Deepspeed zero offloads? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children) So haven't tested Deepspeed yet - will do in the next few days - but DDP works great on the Pro / Max code paths - Open source will seg fault sadly on multi GPUs since the code mechanisms are different - you will still get a 5x speed boost though with all our tricks! * permalink * embed * save * parent * report * reply [-]Calandiel 2 points3 points4 points 16 hours ago (3 children) Could you consider adding axolotl to the comparison graph? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 15 hours ago (2 children) Will do! * permalink * embed * save * parent * report * reply [-]iamMess 2 points3 points4 points 15 hours ago (1 child) And share the config used. * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Yep! * permalink * embed * save * parent * report * reply [-]TheEasternContrarian 2 points3 points4 points 13 hours ago (2 children) love not just the package but the comprehensive well-documented examples already! I have a more individual question if you don't mind. what would you give as a suggestion to someone who's getting started to learn writing custom kernels (cuda or triton)? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 12 hours ago (1 child) Thanks! Oh Triton has some cool docs / tutorials which I extensively used for Unsloth - https://triton-lang.org/main/getting-started/ tutorials/index.html - also our kernels at https://github.com/ unslothai/unsloth/tree/main/unsloth/kernels have tonnes of comments and I tried my best to make it super readable * permalink * embed * save * parent * report * reply [-]TheEasternContrarian 0 points1 point2 points 2 hours ago (0 children) Thank you. The kernel comments are quite clear and intuitive! It looks like to get started, I would really have to know the math transformation and process, and then using the DSL will just be a matter of reading the doc and moving the blocks? * permalink * embed * save * parent * report * reply [-]Tough-Sound-6985 2 points3 points4 points 12 hours ago (1 child) Would the inference speed got improved with the new kernels? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 12 hours ago (0 children) Yes butttt some kernels don't work yet since its optimized for training only - and inference has even more tricks you can use!! I'll see if I push changes in the coming days! * permalink * embed * save * parent * report * reply [-]Danny_Davitoe 2 points3 points4 points 9 hours ago (1 child) Will this work with CPU only machines? * permalink * embed * save * report * reply [-]danielhanchen[S] 2 points3 points4 points 8 hours ago (0 children) I'm working on making CPU training as well! But currently it's only GPUs * permalink * embed * save * parent * report * reply [-]reallmconnoisseur 2 points3 points4 points 1 hour ago (0 children) Look who became aware of your work :) * permalink * embed * save * report * reply [-]VectorD 1 point2 points3 points 15 hours ago (3 children) How come you use max_seq_length = 2048 instead of 4096 in the collab notebook? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 15 hours ago (2 children) Oh u can change it to 4096 up to you - i just chose 2048. I think bsz = 2 still works. The savings are still as described - maybe even more on larger sequence lengths. * permalink * embed * save * parent * report * reply [-]VectorD 2 points3 points4 points 15 hours ago (1 child) I see thanks I see that multi-gpu is paygated, I am a hobbyist with a 4x 4090 rig for ML. How much do you charge for Unsloth Pro/Max? * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0 children) We're working on the pricing plan as we speak!! Sorry everything is all very hectic so still in the drawing board * permalink * embed * save * parent * report * reply [-]iCTMSBICFYBitch 1 point2 points3 points 15 hours ago (1 child) This is incredible. Well done and thank you! * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Thanks! * permalink * embed * save * parent * report * reply [-]bymihaj 1 point2 points3 points 15 hours ago (1 child) https://unsloth.ai/introducing mentioned AMD GPU. What is the status? Will be interference available? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Ye AMD and Intel via Triton - so we Tritonized all kernels, so in theory it should work - even the bitsandbytes 4 bit step is in Triton - I still need to verify if the Flash Attention kernels via Triton works or not * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 1 point2 points3 points 14 hours ago (1 child) I never trained with huggingface, so that comparison is not very clear to me. Is it faster then qlora with axolotl, flash attention 2 enabled and sample_packing disabled? If you claim to use 50% less memory than qlora, that would mean that training a model such as nf4 llama 2 7b would use about 4GB of gpu memory, which is almost less than the quantized weight of the model itself. Is that the case? Call me sceptical, but you have to do that if someone is promoting their paid product. * permalink * embed * save * report * reply [-]danielhanchen[S] 3 points4 points5 points 14 hours ago (0 children) Yep it's still faster than if FA2 = True, packing = False and via axolotl - I'll provide some benchmarks later - the performance benefits will be less since FA2 will shave a chunk off of the running time though. Oh noo so 7B will use 7.8GB of VRAM on OASST - the weights take 4.8GB or so, whilst LoRA and gradients take 3GB. Other datasets are closer to the 50% reduction in training memory usage. Apologies on if it seems like I was promoting a paid product - technically we don't even have a price as we're very new to this - the issue was in the past I also released some faster training methods, but it was eaten up by big corpos, but we wanted to provide the most to the OSS community - hence the gating of some aspects of code. We're still figuring out our pricing plans. * permalink * embed * save * parent * report * reply [-]LJRE_auteur 1 point2 points3 points 14 hours ago (1 child) Christmas has been the entire year for AI enthusiasts x). I can't wait for it to be implemented for Windows and/or UIs for LLMs. * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 14 hours ago (0 children) Working on it! Windows - we're trying to see somehow if it can be supported! * permalink * embed * save * parent * report * reply [-]deck4242 1 point2 points3 points 14 hours ago (1 child) Good stuff ! * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Thanks! * permalink * embed * save * parent * report * reply [-]Paulonemillionand3 1 point2 points3 points 14 hours ago (1 child) fantastic work. I was previously able to use llama-recipies to tune 13b but recent updates cause it now run out of memory. Hopefully this allows that (2x3090) * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Tell me how it goes!!! * permalink * embed * save * parent * report * reply [-]topiga 1 point2 points3 points 13 hours ago (2 children) Is it possible to convert the result to GGUF after ? Also, do you have any exemples for Mistral ? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 13 hours ago (1 child) GGML maybe in the future :) Mistral today / tomorrow!!! * permalink * embed * save * parent * report * reply [-]topiga 1 point2 points3 points 13 hours ago (0 children) Nice ! Thanks * permalink * embed * save * parent * report * reply [-]CasimirsBlake 1 point2 points3 points 12 hours ago (2 children) For those of us that would just like to try a model that's been put through this fine tuning, it'd be nice if folks could upload some to huggingface... Any chance of GGUF models? P40s would benefit so much from these improvements. Or does this not make inference any faster yet? * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 12 hours ago (1 child) Currently it works for training - inference is in the works! GGML I'll see if we can support it! * permalink * embed * save * parent * report * reply [-]evilnebster 0 points1 point2 points 5 hours ago (0 children) Does it work with the P40s then? Above, you only mentioned nvidia turing and later * permalink * embed * save * parent * report * reply [-]Timotheeee1 1 point2 points3 points 12 hours ago (1 child) have you also tried the sophia optimizer? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 12 hours ago (0 children) No I haven't yet - but will try to! I think I've read about it via the Machine Learning subreddit or somewhere - I'll report back! * permalink * embed * save * parent * report * reply [-]hprnvx 1 point2 points3 points 12 hours ago (2 children) Will it work with 1060 6gb? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 12 hours ago (1 child) Oh my probably not - the lowest support is probably CUDA compute capability 7.5. Sadly CUDA 6 is just off the mark sadly * permalink * embed * save * parent * report * reply [-]hprnvx 0 points1 point2 points 11 hours ago (0 children) Ok, thx anyway :) * permalink * embed * save * parent * report * reply [-]cyryscyn 1 point2 points3 points 10 hours ago (1 child) Awesome * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children) Thanks!! Hope you can try it out!! :) * permalink * embed * save * parent * report * reply [-]BoneDaddyMan 1 point2 points3 points 9 hours ago (1 child) The sample in github says the context is 2048. Can it finetune with 4096 context? Is this for llama2? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children) You can change it to whatever you like! :) Yep llama2 * permalink * embed * save * parent * report * reply [-]Woof9000 1 point2 points3 points 9 hours ago (1 child) ngl, this is very sexy post * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]bash99Ben 1 point2 points3 points 7 hours ago (1 child) Will it support V100 32G GPU? * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 7 hours ago (0 children) It does already! :))) * permalink * embed * save * parent * report * reply [-]You_Wen_AzzHu 1 point2 points3 points 6 hours ago (1 child) Holy shit , this is groundbreaking. * permalink * embed * save * report * reply [-]danielhanchen[S] 1 point2 points3 points 6 hours ago (0 children) :) * permalink * embed * save * parent * report * reply [-]wind_dude 1 point2 points3 points 6 hours ago (1 child) wow stats sound impressive, I'll have to try this on my next training run! * permalink * embed * save * report * reply [-]danielhanchen[S] 0 points1 point2 points 6 hours ago (0 children) Thanks! * permalink * embed * save * parent * report * reply [-]nntb 1 point2 points3 points 19 hours ago (1 child) I can't wait until people start talking about Snapdragon support like the Snapdragon 8 which actually has tensor cores in AI elements inside of it and allowing phones with that to start doing local AI there's already one project I know that lets you do it but it would be great to see other people get on board and and start developing * permalink * embed * save * report * reply [-]danielhanchen[S] 2 points3 points4 points 19 hours ago (0 children) Interesting tensor cores on the phone - ye local AI finetuning does sound pretty sick * permalink * embed * save * parent * report * reply [-]lkraven -1 points0 points1 point 17 hours ago (0 children) e * permalink * embed * save * report * reply [+][deleted] 18 hours ago (2 children) [deleted] [-]tompute 8 points9 points10 points 18 hours ago (1 child) They claim faster performance with 0% loss of accuracy. They are not claiming 0% accuracy. There a difference... * permalink * embed * save * report * reply [-]danielhanchen[S] 3 points4 points5 points 17 hours ago (0 children) Thanks for that! Ye so all there's no approximation methods - all exact computations - we just did some maths and coding trickery :) Oops maybe I should have worded the title better * permalink * embed * save * parent * report * reply [-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children) Also join our Discord if you wanna chat AI and stuff or learn more about Unsloth! https://discord.gg/AecqJdXGz5 * permalink * embed * save * report * reply [-]arnott 0 points1 point2 points 5 hours ago (4 children) May be off topic: is it possible to finetune based on custom set of documents and not only with prompts & answers? * permalink * embed * save * report * reply [-]FullOf_Bad_Ideas 1 point2 points3 points 3 hours ago (3 children) What kind of output do you want your model to do? Should it just hallucinate continuation of documents? If so, yes. * permalink * embed * save * parent * report * reply [-]arnott 0 points1 point2 points 3 hours ago (2 children) No hallucinations. Answer questions based on the documents. * permalink * embed * save * parent * report * reply [-]FullOf_Bad_Ideas 1 point2 points3 points 3 hours ago (1 child) Then you would need to transform documents into question answer dataset using some llm. Also, this is not something that works well with lora/qlora. Even after training on transformed dataset, you will probably still get a ton of hallucinations, lora is more about transfer style rather than teaching new knowledge. What you want is just RAG. * permalink * embed * save * parent * report * reply [-]arnott 0 points1 point2 points 2 hours ago (0 children) Ok, thanks. * permalink * embed * save * parent * report * reply [-]ajibawa-2023 0 points1 point2 points 5 hours ago (0 children) Interesting development! I have fully finetuned 17 models but never tried LoRA or qLoRA. I will try it out. Thanks & keep up the good work! * permalink * embed * save * report * reply [-]kaszebe 0 points1 point2 points 5 hours ago (0 children) Hi OP, u/danielhanchen Is there a "guide for complete morons" that will allow a n00b like me to fine tune with your finetuning? I have a 4090 gaming rig. Also, do I need to provide the system with a ton of source material? (e.g. scraped websites) or can I just provide it with a list of instructions that I want it to follow every time it writes something for me (e.g. "don't use passive voice,"write at a college level" etc)? I'm a writer and I use AI to help me write. thank you * permalink * embed * save * report * reply [-]dervu 0 points1 point2 points 3 hours ago (0 children) Anyone can help newbie in AI training if it is worth doing fine tuning such model when I have one 4090 24GB? I would like to fine tune it on project code that otherwise would be not good to leak to external AIs. I would either like to fine tune on smaller project first, then on bigger one. Is setting it up, preparing code, and time spent to train it on one GPU worth the hassle to have give answers regarding this code project and maybe help with alternate approaches to code? * permalink * embed * save * report * reply [-]athirdpath 0 points1 point2 points 2 hours ago (0 children) Thank you so much! Do you intend to add DPO training support? * permalink * embed * save * report * reply [-]watkykjynaaier 0 points1 point2 points 1 hour ago (0 children) Was the decision to adopt the Apple-ish Pro/Max product segmentation intentional? Bc to me it implies an association with the M chips and that confused me, especially now that I've seen this won't run on Apple GPUs at all. If you're still calibrating your product offering I would strongly suggest a renaming. * permalink * embed * save * report * reply [-]ii-___-ii 0 points1 point2 points 23 minutes ago (0 children) Great work * permalink * embed * save * report * reply [-]oc-homelabber 0 points1 point2 points 6 minutes ago (0 children) Just an FYI. On the GH page, it links to "https://www.unsloth.ai" and that link doesn't work. I had to go to "https://unsloth.ai" to visit the webpage. * permalink * embed * save * report * reply * about * blog * about * advertising * careers * help * site rules * Reddit help center * reddiquette * mod guidelines * contact us * apps & tools * Reddit for iPhone * Reddit for Android * mobile website * <3 * reddit premium Use of this site constitutes acceptance of our User Agreement and Privacy Policy. (c) 2023 reddit inc. All rights reserved. REDDIT and the ALIEN Logo are registered trademarks of reddit inc. [pixel] p Rendered by PID 97888 on reddit-service-r2-loggedout-dbcc8b4b7-62nzk at 2023-12-01 23:00:37.542020+00:00 running a450159 country code: US.