https://old.reddit.com/r/LocalLLaMA/comments/188197j/80_faster_50_less_memory_0_accuracy_loss_llama/

jump to content
my subreddits
edit subscriptions

  * popular
  * -all
  * -random
  * -users

 | 

  * AskReddit
  * -pics
  * -funny
  * -gaming
  * -worldnews
  * -news
  * -movies
  * -explainlikeimfive
  * -mildlyinteresting
  * -videos
  * -todayilearned
  * -DIY
  * -nottheonion
  * -tifu
  * -TwoXChromosomes
  * -aww
  * -dataisbeautiful
  * -OldSchoolCool
  * -LifeProTips
  * -Music
  * -books
  * -science
  * -Jokes
  * -Showerthoughts
  * -Futurology
  * -sports
  * -askscience
  * -IAmA
  * -space
  * -gifs
  * -gadgets
  * -UpliftingNews
  * -history
  * -nosleep
  * -food
  * -InternetIsBeautiful
  * -announcements
  * -WritingPrompts
  * -philosophy
  * -Documentaries
  * -GetMotivated
  * -creepy
  * -EarthPorn
  * -photoshopbattles
  * -listentothis
  * -blog

more >>
reddit.com LocalLLaMA

  * comments

Want to join? Log in or sign up in seconds.|

  * English

[                    ][]
[ ]limit my search to r/LocalLLaMA

use the following search parameters to narrow your results:

subreddit:subreddit
    find submissions in "subreddit"
author:username
    find submissions by "username"
site:example.com
    find submissions from "example.com"
url:text
    search for "text" in url
selftext:text
    search for "text" in self post contents
self:yes (or self:no)
    include (or exclude) self posts
nsfw:yes (or nsfw:no)
    include (or exclude) results marked as NSFW

e.g. subreddit:aww site:imgur.com dog

see the search faq for details.

advanced search: by author, subreddit...

this post was submitted on  01 Dec 2023
513 points (98% upvoted)
shortlink:  [https://redd.it/1881]
[                    ][                    ]
[ ]remember mereset password
login
Submit a new link
Submit a new text post
 
Get an ad-free experience with special benefits, and directly support
Reddit.
get reddit premium
LocalLLaMA

joinleave82,817 readers

2,611 users here now

---------------------------------------------------------------------

r/LocalLLaMA

A subreddit to discuss about Llama, the family of large language
models created by Meta AI.

---------------------------------------------------------------------

Useful Links

Subreddit rules

Index resources

Search by flair

+Discussion

+Tutorial | Guide

+New Model

+News

+Other

a community for 8 months

MODERATORS

  * message the mods

discussions in r/LocalLLaMA
<>
X
 
511 * 185 comments
[jNpzm1Lh]
80% faster, 50% less memory, 0% accuracy loss Llama finetuning
 
* 12 comments
#1 on LLM Leaderboard for 7B: Chupacabra-v2 - WE MADE IT!
 
52 * 35 comments
RAG + real TXT book + Yi34b-chat = creative writing beast
 
15 * 1 comment
Notus-7B-v1, new OSS LLM trained with DPO and cleaned version of
UltraFeedback
 
25 * 88 comments
Is there really no way you can run 70b models without having a very
fast GPU or a lot of ram?
 
21 * 33 comments
Multiple 4090s instead of H100?
 
9 * 6 comments
What are the highest performing model less than 500M Parameters
 
19 * 11 comments
LLama with RAG
 
16 * 6 comments
[37ALdrax]
Incoming: TensorRT-LLM version 0.6 with support for MoE, new models
and more quantization
 
420 * 66 comments
What All Dropped Recently:

Welcome to Reddit,

the front page of the internet.

Become a Redditor

and join one of thousands of communities.

x

512
513
514
[jNpzm1Lh]

80% faster, 50% less memory, 0% accuracy loss Llama finetuning
Tutorial | Guide (self.LocalLLaMA)

submitted 19 hours ago * by danielhanchen

Hey r/LocalLLaMA community!

Just launched our open source 5x faster finetuning package Unsloth
https://github.com/unslothai/unsloth where you can finetune Llama
models:

  * 5x faster
  * Use 50% less memory
  * With 0% loss in accuracy
  * All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, A100, H100s)
    for free!
  * QLoRA / LoRA is now 80% faster to train.

We manually hand derived backpropagation steps, wrote all kernels in
OpenAI's Triton language and applied some more maths and coding
trickery. You can read more about our tricks via https://unsloth.ai/
introducing.

I wrote a Google Colab for T4 for Alpaca: https://
colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp
=sharing which finetunes Alpaca 2x faster on a single GPU.

On Kaggle via 2 Tesla T4s on DDP: https://www.kaggle.com/
danielhanchen/unsloth-laion-chip2-kaggle, finetune LAION's OIG 5x
faster and Slim Orca 5x faster.

5X faster finetuning on Slim Orca - 1301 hours to now 260 hours.

You can install Unsloth all locally via:

pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121] @ git+https://github.com/unslothai/unsloth.git"

Currently we only support Pytorch 2.1 and Linux distros - more
installation instructions via https://github.com/unslothai/unsloth/
blob/main/README.md

We hope to:

 1. Support other LLMs other than Llama style models
 2. Add sqrt gradient checkpointing to shave another 25% of memory
    usage.
 3. And other tricks!

  * 185 comments
  * share
  * save
  * hide
  * report

all 185 comments
sorted by:
best
topnewcontroversialoldrandomq&alive (beta)
 [                    ]

Want to add to the discussion?

Post a comment!

Create an account

[-]21022018 39 points40 points41 points 19 hours ago (2 children)

How does this compare to QLoRA or LoRA?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 47 points48 points49 points 19 hours ago (1
child)

Oh it makes QLoRA 80% faster!!! So if you already used QLoRA, it
makes it faster. I also support LoRA, which also makes it faster - a
bit less of a speedup though. I editted the post to mention it :)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]mcmoose1900 1 point2 points3 points 49 minutes ago (0 children)

Does it reduce VRAM usage much?

Also, either way, this super cool and awesome. Its insane that
everyone is training llama in eager mode.

I'm looking forward to the planned DPO training as well.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Kindly-Abroad-3781 37 points38 points39 points 19 hours ago (7
children)

Thank you so much for this awesome open-source work! From what I
gather on your blog, all the improvements are a result of Manual
autograd and switching all the kernels to OpenAI's Triton kernel,
right?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 22 points23 points24 points 19 hours ago (2
children)

https://unsloth.ai/introducing has more deets on the manual autograd
methods and Triton kernels.

  * other coding tricks like inplace operations, reduced memory
    movements, etc.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Kindly-Abroad-3781 5 points6 points7 points 18 hours ago (1 child)

Awesome, looking forward to the new blog!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 18 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 21 points22 points23 points 19 hours ago (3
children)

I might write a full blog about all the changes we did if you're
interested

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Kindly-Abroad-3781 3 points4 points5 points 14 hours ago (1 child)

    r datasets if that works - do you hav

I just had a quick look at the source code of Unsloth, and
surprisingly, Even though the open version already implemented
acceleration strategies like flash attention, the max and pro
versions of Unsloth can actually boost training speed by more than 5
times. If possible, I'm really looking forward to learning about the
strategies used in the max/pro version.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0
children)

Oh ye you can boost it even further with more maths and coding hacks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Relevant_Outcome_726 24 points25 points26 points 19 hours ago (15
children)

Can we use this for fine-tuning Mistral?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 60 points61 points62 points 19 hours ago (14
children)

Currently no - I will push some changes to allow it in a few days -
technically Mistral's model arch is the same as Llama, so it should
be an easy change - I'll msg you once it's done

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]OnY86 2 points3 points4 points 15 hours ago (3 children)

Nice to hear! Message me too please, thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (2 children)

Cool!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]BayesMind 1 point2 points3 points 14 hours ago (1 child)

mistral sounds great thank you!, I'll sub to your repo for updates!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]UserMinusOne 1 point2 points3 points 15 hours ago (1 child)

Nice to hear! Message me too please, thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Yep!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Paulonemillionand3 1 point2 points3 points 14 hours ago (3
children)

I'm also looking for exactly that!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 14 hours ago (2 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]DickMasterGeneral 2 points3 points4 points 13 hours ago (1 child)

Me too, if you don't mind

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 12 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tiny_Arugula_5648 1 point2 points3 points 11 hours ago (1 child)

me too!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 11 hours ago (0 children)

cool!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Square-Tooth2635 -1 points0 points1 point 9 hours ago (1 child)

!RemindMe 7days

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]RemindMeBot 1 point2 points3 points 9 hours ago* (0 children)

I will be messaging you in 7 days on 2023-12-08 13:07:28 UTC to
remind you of this link

3 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to
reduce spam.

^Parent commenter can ^delete this message to hide from others.

---------------------------------------------------------------------

^Info ^Custom ^Your Reminders ^Feedback

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]silenceimpaired 24 points25 points26 points 15 hours ago (15
children)

Never trained... wish you could have a soo you've never trained guide
:)

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 42 points43 points44 points 15 hours ago (13
children)

Oh so like a full step up step guide on training a dataset - even the
dataset prep stage etc?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]silenceimpaired 25 points26 points27 points 15 hours ago (6
children)

Yup. I know the data set could just be a plain text file... but people
see json all the time and aren't sure what to make of that.. or how
to get started. A simple walkthrough encourages people to explore the
scary alien terrain :)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 20 points21 points22 points 15 hours ago (5
children)

Oh interesting - I'll write up an example - I'll ping you once it's
done!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]thewayupisdown 1 point2 points3 points 10 hours ago (0 children)

Me too, please and thank you!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]potatodioxide 1 point2 points3 points 9 hours ago (0 children)

me too please!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]pmp22 0 points1 point2 points 9 hours ago (0 children)

Ping me too please!

Also, a small table with model size and hardware requirements would
be nice, to get a ballpark for what hardware is needed for what etc.
Say I have a 4090, what can I fine tune with that and how long will
it take?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]letchhausen 0 points1 point2 points 1 hour ago (0 children)

Me, too, please!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Koliham 5 points6 points7 points 15 hours ago (1 child)

An guide to train with example datasets, that we don't make mistakes
in the Instruct format would be great

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 5 points6 points7 points 14 hours ago (0
children)

I have some Colab notebooks - https://colab.research.google.com/drive
/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp=sharing for Alpaca.

I can make more for other datasets if that works - do you have any
suggestions?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]psdwizzard 1 point2 points3 points 10 hours ago (1 child)

I would really like to see that too

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 8 hours ago (0 children)

Coolies!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]jwyer 1 point2 points3 points 10 hours ago (1 child)

That would be great and help out a lot of people, making llms even
more accessible.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children)

I'll make one!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]azriel777 0 points1 point2 points 1 hour ago (0 children)

We really need a train your first A.I. Model for dummies book.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Aaaaaaaaaeeeee 8 points9 points10 points 18 hours ago (11
children)

Currently, I can finetune a 34B on 24gb at a maximum of 192 ctx at
rank 8 with the huggingface model at 4bit.

  * I have a feeling the hf 4bit model is too large, is this able to
    shrink the size, or just the excess post model loading?

  * If I used a smaller bpw GPTQ model, could still use the library?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 8 points9 points10 points 17 hours ago (2
children)

Haven't tried it on 34B yet, but it should also reduce mem usage by
50%, ie your batches can be approx 6x larger according to our matrix
size calculations.

But essentially we still load the model as 4bit, then do all the
memory shrinking during the training process

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Aaaaaaaaaeeeee 5 points6 points7 points 16 hours ago (1 child)

You legend. Thanks for sharing these optimizations!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 3 points4 points5 points 16 hours ago (0
children)

Thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 3 points4 points5 points 15 hours ago (7
children)

I am fine-tuning yi-34b on 24gb 3090 ti with ctx size 1200 using
axolotl. If you want some tips and tricks with it I can help you to
get up to what I am getting. I haven't tried unsloth yet but I am a
touch sceptical.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 14 hours ago (4 children)

Oh I'm not sure if Yi is supported - I heard it's just Llama's arch
so I'll make it work - Axolotl is cool though!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 2 points3 points4 points 14 hours ago (1 child)

I am fine-tuning on the llama-fied yi-34b https://huggingface.co/
chargoddard/Yi-34B-Llama/tree/llama-tokenizer It's the same structure
as llama, so unless someone hardcoded parameters like number of
heads, layers hidden sizes and all of those magic numbers, software
that supports llama 1 33b should also support yi-34b-llama without
any patches.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0
children)

Oh wait it's also "LlamaForCausalLM" - it should work then - i just
haven't verified fully if grouped query attention works as expected -
hopefully my handling of it works

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 2 points3 points4 points 14 hours ago (1 child)

If it's possible to use unsloth to train 34B model in qlora with
context length of 4096 on 24GB GPU it would be a big deal.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 13 hours ago (0 children)

Probably? I haven't tried it out loll I'll probably run it on a A100
instance via Colab and see the peak memory usage.

I think 4096 is fine, since at 2048 for 7B, the max batch size I
found to work was around 14!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Aaaaaaaaaeeeee 0 points1 point2 points 12 hours ago (1 child)

I would appreciate it! You could share a config or maybe make a post
with tips for other lone 3090s to replicate. I don't have my setup
full optimized because I still use 0.5-0.6 for my monitor.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 1 point2 points3 points 11 hours ago (0 children)

Config is here https://huggingface.co/adamo1139/
Yi-34B-Spicyboros-2-2-run3-QLoRA/tree/main/config Secret sauce is to
enable flash attention and disable sample packing. Something like
1400-1700 ctx should be achievable if you run the pc without monitor
or use igpu. I saved 10 bucks on buying Intel cpu with igpu fused off
and it's biting me into the ass now.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]EntertainmentBroad43 6 points7 points8 points 19 hours ago (1
child)

How's the memory consumption compared to Qlora?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 16 points17 points18 points 19 hours ago (0
children)

Apologies didn't mention it - the 80% faster is making QLoRA / LoRA
itself 80% faster and use 50% less memory.

So on the Open Assistant dataset, memory usage via QLoRA is shaved
from 14GB to 7.8GB on bsz = 2, ga = 4. You can now fit even larger
batches via QLoRA

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]ExtensionCricket6501 6 points7 points8 points 18 hours ago (3
children)

Interesting, any estimates for the minimum vram requirement to train
the llama variants now? (7b,13b,34b,70b), seems like VRAM reduces a
lot on just Open.

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 19 points20 points21 points 18 hours ago (2
children)

Oh yes so depending on the dataset for eg Alpaca takes 6.8GB of VRAM
on batch size = 2. If you do bsz=1, it'll be even less - I haven't
tested it yet. On OASST, VRAM is reduced to 7.8GB from 14GB.

For 13B - I don't have the numbers but also 50% reduction. On 34B and
70B sadly haven't tested it yet - will do so - but presumably again
50% reduction.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]g3t0nmyl3v3l 8 points9 points10 points 16 hours ago (1 child)

Dude that is insane. Amazing work, you rock!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 7 points8 points9 points 16 hours ago (0
children)

Thanks a bunch!!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]CjqM8012 6 points7 points8 points 16 hours ago (1 child)

Nice work! I am yet to go through the blog, but, could any of these
optimizations be applied to inference aswell?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 9 points10 points11 points 16 hours ago (0
children)

Thanks! Yep - working on inference now!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tasty-Lobster-8915 5 points6 points7 points 19 hours ago (14
children)

I would like to try this! Can you give an example of a full tune
script?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 8 points9 points10 points 19 hours ago (13
children)

Thanks! We have complete examples via Google Colab: https://
colab.research.google.com/drive/1oW55fBmwzCOrBVX66RcpptL3a99qWBxb?usp
=sharing for Alpaca, and LAION's OIG via Kaggle on 2 GPUs: https://
www.kaggle.com/danielhanchen/unsloth-laion-chip2-kaggle

Both are free to run!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tasty-Lobster-8915 3 points4 points5 points 19 hours ago (12
children)

Thanks for those. In both of the links you sent, I see the Lora rank
and targets are set during initialisation? Do you have an example of
how to run a full finetune of all parameters (non-LORA)?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 19 hours ago (11
children)

Ohhh a full finetune - currently sadly it's not supported - only
QLoRA for now sorry.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tasty-Lobster-8915 2 points3 points4 points 19 hours ago (10
children)

Ahh.. any plans for support in the future?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 3 points4 points5 points 19 hours ago (9
children)

Technically yes, but sadly since my bro and I are fully bootstrapping
this as a startup, we decided to push it with our Pro and Max plans -
we're still not sure how to monetize it yet - as a platform? Sell the
code? Etc

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]OVAWARE 5 points6 points7 points 18 hours ago (1 child)

Well you should start with donations, its not much but its easy to
setup and can help you get started, then maybe you can sell a API
service for training?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 17 hours ago (0
children)

Ye good point!! I'll ask my bro for this - thanks so much for the
help!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tasty-Lobster-8915 2 points3 points4 points 17 hours ago (1 child)

I'm still potentially interested depending on your price point!
Looking forward to when your "pro" and "max" versions release!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children)

:)! Having discussions on pricing and stuff - just not sure how we're
gonna approach it - if you have any have any pricing ranges you might
feel that is right - that'll be sick!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Crafty-Run-6559 2 points3 points4 points 17 hours ago* (3
children)

Sell easy Qloras for $ per hour.

Make it a simple upload your training data (better yet, provide a
bunch of different sets for use) tune your settings/hyperparameters,
and wait for an emailed link to your qlora.

People will pay for that, and it's recurring revenue.

If you release the core training code like you have, then it makes it
easy for people to trust it.

Just start releasing it under the same license as Mongo or AGPL.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 17 hours ago (2
children)

Ye a finetuning platform! One issue I'm still figuring out is somehow
integrating GPUs via AWS / Google Cloud - I was trying to say hook up
Colab internally to run it since we found Colab to be the cheapest

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Crafty-Run-6559 2 points3 points4 points 17 hours ago (1 child)

Could always start off with some used 4090s or 3090s lol

It's background batch processing with relatively low bandwidth
requirements.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 16 hours ago (0
children)

Yeee I thought about that - it's not a bad point I guess - thanks for
the ideas - I'll chat with my bro more about this! Appreciate it!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]SmolGnoll 0 points1 point2 points 4 hours ago (0 children)

You will get hired on the back of this. Advertise your contacts,
publish a paper.

Also, I am very interested in whether these optimisations can be
applied to full fine tunes.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 4 points5 points6 points 18 hours ago (13
children)

What's the reason that this is faster? Custom kernels?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 17 points18 points19 points 18 hours ago (12
children)

Custom kernels in Triton, Flash Attention, inplace ops, manual
derivation of matrix differentials, chained matrix bracketing,
reduced data movement and more!!! https://unsloth.ai/introducing has
more deets :)

I'll write up a full blog post if you're interested!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 3 points4 points5 points 18 hours ago (11
children)

That would be appreciated! I wonder if they could integrate these
into BnB, that could be very fast LOL. I guess there's ExllamaV2.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 18 hours ago (10
children)

Oh ye that would be cool! I'll talk with Tim Dettmers from BnB about
it!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child)

Or maybe integrate into Transformers itself and/or PEFT/Trainer?
Would be huge.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 18 hours ago (0
children)

Ye good point - I'll see what I can do with my bro! :)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 4 points5 points6 points 18 hours ago (3
children)

Also, can you share more information on Unsloth Pro and Max?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 18 hours ago (2 children)

Ye so Pro makes training even faster from 5X to 28X ish faster,
supports multi GPU training.

Max further speeds it up to 31x, but the difference is Max makes it
possible to work on Intel, AMD GPUs, and supports full finetuning and
training.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child)

That sounds nice, can you provide detail on the further
optimizations? Or is that a secret sause?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 3 points4 points5 points 17 hours ago (0
children)

So our blog https://unsloth.ai/introducing has a bit more - but for
the Pro and Max versions - that's our specialty! :)

If you're interested I'll write a detailed blog post about all the
changes we made in the open source version

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 2 points3 points4 points 18 hours ago (3
children)

Sorry for multiple comments like this, but maybe CUDA kernels are
faster?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 18 hours ago (2 children)

I found CUDA kernels for non jitted code to be faster - ie if you run
CUDA kernels only once or twice since there's a JIT compiling cost
via Triton. In general, CUDA and Triton are equal in terms of speed -
Triton more so since you can try out more hypotheses.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bot-333Airoboros 2 points3 points4 points 18 hours ago (1 child)

Interesting, thanks.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Kgcdc 5 points6 points7 points 18 hours ago (2 children)

Will this work on my SMC box with 10 L40S? Happy to give you access
to test if needed.

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 4 points5 points6 points 18 hours ago (1 child)

Hey! I was just about to test it via Google Cloud's L40 instances! So
via DDP (Deepspeed still in the works), our other offerings Pro and
Max allow support. I'm bootstrapping this as a startup with my
brother, so sadly we decided to make it a paid component to cover our
life expenses. Can chat if you're interested!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Kgcdc 1 point2 points3 points 18 hours ago (0 children)

I have L40S not L40. But let's chat since we are looking for the
right inference server.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Techyogi 11 points12 points13 points 19 hours ago (1 child)

Any chance for apple silicon support??

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 15 points16 points17 points 19 hours ago (0
children)

Currently no sadly - I don't know how to write Apple kernels, but
technically because everything is written in Triton, it should work
for AMD and Intel GPUs as well.

On CPUs - maybe in the future via BLAS and C++ code if people are
interested.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]CanIstealYourDog 3 points4 points5 points 17 hours ago (4
children)

I'm fine tuning Llama 2 7B using QLora on Nvidia A6000. Would this
work for that?

  * permalink
  * embed
  * save
  * report
  * reply

[-]Aaaaaaaaaeeeee 3 points4 points5 points 16 hours ago (2 children)

  * All locally on NVIDIA GPUs (Tesla T4, RTX 20/30/40, A100, H100s

(Ampere is supported).

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 16 hours ago (1 child)

Thanks! Yep Ampere! Hopper etc! Oops maybe I should have wrote that

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 16 hours ago (0 children)

Yep!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]ambientswan 3 points4 points5 points 14 hours ago (1 child)

Thank you for your work! Any chance for this support Apple Silicon/
Metal?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Thanks! Yep maybe in the future!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]iLaurens 4 points5 points6 points 13 hours ago (1 child)

The pricing page for unsloth pro says this as header:

"Unlock our 30x faster algorithm for multiple GPUs"

But then in the bullets below it says "single GPU only".

So what's the deal with pro? Is it single or multi gpu training?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 10 hours ago (0 children)

OHH ye we're still figuring it out as we go along - much apologies -
after discussions with people and my bro - the Pro will in fact be
multi GPU supported, and most likely at the price of a game for
hobbyists - the issue is we didn't expect the Pro/Max to have
interest - our goal was to first showcase the OSS one, and so we
didn't really plan for the Pro/Max yet. I'll update the details once
it's all confirmed

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]stormer0 3 points4 points5 points 7 hours ago (0 children)

the talent posting here is pretty insane. Blows my mind how quickly
people are iterating on this. Thank god for open source

  * permalink
  * embed
  * save
  * report
  * reply

[-]tgredditfc 2 points3 points4 points 18 hours ago (3 children)

Awesome! I really need to reduce vRAM usage as I need to train with
cutoff length of 2048 which costs tremendous vRAM! Can I run it in
WSL?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 4 points5 points6 points 18 hours ago (1 child)

It would be fabulous if you can report back to see if it works - I
can also help debug the installation if that helps

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]tgredditfc 2 points3 points4 points 18 hours ago (0 children)

Will do!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 18 hours ago (0
children)

WSL should work hopefully? I'm not 100% sure have not tried it - but
hopefully it works

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]wishtrepreneur 2 points3 points4 points 18 hours ago (1 child)

When do you have mistral finetuning planned?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 17 hours ago (0
children)

In the next few days - I'll ping you!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]No-Link-2778 2 points3 points4 points 17 hours ago (1 child)

What about Deepspeed zero offloads?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 17 hours ago (0 children)

So haven't tested Deepspeed yet - will do in the next few days - but
DDP works great on the Pro / Max code paths - Open source will seg
fault sadly on multi GPUs since the code mechanisms are different -
you will still get a 5x speed boost though with all our tricks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Calandiel 2 points3 points4 points 16 hours ago (3 children)

Could you consider adding axolotl to the comparison graph?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 15 hours ago (2 children)

Will do!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]iamMess 2 points3 points4 points 15 hours ago (1 child)

And share the config used.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Yep!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]TheEasternContrarian 2 points3 points4 points 13 hours ago (2
children)

love not just the package but the comprehensive well-documented
examples already!

I have a more individual question if you don't mind. what would you
give as a suggestion to someone who's getting started to learn
writing custom kernels (cuda or triton)?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 12 hours ago (1 child)

Thanks! Oh Triton has some cool docs / tutorials which I extensively
used for Unsloth - https://triton-lang.org/main/getting-started/
tutorials/index.html - also our kernels at https://github.com/
unslothai/unsloth/tree/main/unsloth/kernels have tonnes of comments
and I tried my best to make it super readable

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]TheEasternContrarian 0 points1 point2 points 2 hours ago (0
children)

Thank you. The kernel comments are quite clear and intuitive!

It looks like to get started, I would really have to know the math
transformation and process, and then using the DSL will just be a
matter of reading the doc and moving the blocks?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Tough-Sound-6985 2 points3 points4 points 12 hours ago (1 child)

Would the inference speed got improved with the new kernels?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 12 hours ago (0 children)

Yes butttt some kernels don't work yet since its optimized for
training only - and inference has even more tricks you can use!! I'll
see if I push changes in the coming days!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Danny_Davitoe 2 points3 points4 points 9 hours ago (1 child)

Will this work with CPU only machines?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 8 hours ago (0 children)

I'm working on making CPU training as well! But currently it's only
GPUs

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]reallmconnoisseur 2 points3 points4 points 1 hour ago (0 children)

Look who became aware of your work :)

 

<image>

  * permalink
  * embed
  * save
  * report
  * reply

[-]VectorD 1 point2 points3 points 15 hours ago (3 children)

How come you use max_seq_length = 2048 instead of 4096 in the collab
notebook?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 15 hours ago (2 children)

Oh u can change it to 4096 up to you - i just chose 2048. I think bsz
= 2 still works.

The savings are still as described - maybe even more on larger
sequence lengths.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]VectorD 2 points3 points4 points 15 hours ago (1 child)

I see thanks I see that multi-gpu is paygated, I am a hobbyist with a
4x 4090 rig for ML. How much do you charge for Unsloth Pro/Max?

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 14 hours ago (0
children)

We're working on the pricing plan as we speak!! Sorry everything is
all very hectic so still in the drawing board

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]iCTMSBICFYBitch 1 point2 points3 points 15 hours ago (1 child)

This is incredible. Well done and thank you!

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bymihaj 1 point2 points3 points 15 hours ago (1 child)

https://unsloth.ai/introducing mentioned AMD GPU. What is the status?
Will be interference available?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Ye AMD and Intel via Triton - so we Tritonized all kernels, so in
theory it should work - even the bitsandbytes 4 bit step is in Triton
- I still need to verify if the Flash Attention kernels via Triton
works or not

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 1 point2 points3 points 14 hours ago (1 child)

I never trained with huggingface, so that comparison is not very
clear to me. Is it faster then qlora with axolotl, flash attention 2
enabled and sample_packing disabled? If you claim to use 50% less
memory than qlora, that would mean that training a model such as nf4
llama 2 7b would use about 4GB of gpu memory, which is almost less
than the quantized weight of the model itself. Is that the case? Call
me sceptical, but you have to do that if someone is promoting their
paid product.

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 3 points4 points5 points 14 hours ago (0
children)

Yep it's still faster than if FA2 = True, packing = False and via
axolotl - I'll provide some benchmarks later - the performance
benefits will be less since FA2 will shave a chunk off of the running
time though.

Oh noo so 7B will use 7.8GB of VRAM on OASST - the weights take 4.8GB
or so, whilst LoRA and gradients take 3GB. Other datasets are closer
to the 50% reduction in training memory usage.

Apologies on if it seems like I was promoting a paid product -
technically we don't even have a price as we're very new to this -
the issue was in the past I also released some faster training
methods, but it was eaten up by big corpos, but we wanted to provide
the most to the OSS community - hence the gating of some aspects of
code.

We're still figuring out our pricing plans.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]LJRE_auteur 1 point2 points3 points 14 hours ago (1 child)

Christmas has been the entire year for AI enthusiasts x). I can't
wait for it to be implemented for Windows and/or UIs for LLMs.

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 14 hours ago (0 children)

Working on it! Windows - we're trying to see somehow if it can be
supported!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]deck4242 1 point2 points3 points 14 hours ago (1 child)

Good stuff !

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Paulonemillionand3 1 point2 points3 points 14 hours ago (1 child)

fantastic work. I was previously able to use llama-recipies to tune
13b but recent updates cause it now run out of memory. Hopefully this
allows that (2x3090)

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Tell me how it goes!!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]topiga 1 point2 points3 points 13 hours ago (2 children)

Is it possible to convert the result to GGUF after ? Also, do you
have any exemples for Mistral ?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 13 hours ago (1 child)

GGML maybe in the future :) Mistral today / tomorrow!!!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]topiga 1 point2 points3 points 13 hours ago (0 children)

Nice ! Thanks

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]CasimirsBlake 1 point2 points3 points 12 hours ago (2 children)

For those of us that would just like to try a model that's been put
through this fine tuning, it'd be nice if folks could upload some to
huggingface... Any chance of GGUF models? P40s would benefit so much
from these improvements. Or does this not make inference any faster
yet?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 12 hours ago (1 child)

Currently it works for training - inference is in the works! GGML
I'll see if we can support it!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]evilnebster 0 points1 point2 points 5 hours ago (0 children)

Does it work with the P40s then? Above, you only mentioned nvidia
turing and later

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Timotheeee1 1 point2 points3 points 12 hours ago (1 child)

have you also tried the sophia optimizer?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 12 hours ago (0 children)

No I haven't yet - but will try to! I think I've read about it via
the Machine Learning subreddit or somewhere - I'll report back!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]hprnvx 1 point2 points3 points 12 hours ago (2 children)

Will it work with 1060 6gb?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 12 hours ago (1 child)

Oh my probably not - the lowest support is probably CUDA compute
capability 7.5. Sadly CUDA 6 is just off the mark sadly

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]hprnvx 0 points1 point2 points 11 hours ago (0 children)

Ok, thx anyway :)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]cyryscyn 1 point2 points3 points 10 hours ago (1 child)

Awesome

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children)

Thanks!! Hope you can try it out!! :)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]BoneDaddyMan 1 point2 points3 points 9 hours ago (1 child)

The sample in github says the context is 2048. Can it finetune with
4096 context? Is this for llama2?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children)

You can change it to whatever you like! :) Yep llama2

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]Woof9000 1 point2 points3 points 9 hours ago (1 child)

ngl, this is very sexy post

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 8 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]bash99Ben 1 point2 points3 points 7 hours ago (1 child)

Will it support V100 32G GPU?

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 7 hours ago (0 children)

It does already! :)))

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]You_Wen_AzzHu 1 point2 points3 points 6 hours ago (1 child)

Holy shit , this is groundbreaking.

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 1 point2 points3 points 6 hours ago (0 children)

:)

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]wind_dude 1 point2 points3 points 6 hours ago (1 child)

wow stats sound impressive, I'll have to try this on my next training
run!

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 6 hours ago (0 children)

Thanks!

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]nntb 1 point2 points3 points 19 hours ago (1 child)

I can't wait until people start talking about Snapdragon support like
the Snapdragon 8 which actually has tensor cores in AI elements
inside of it and allowing phones with that to start doing local AI
there's already one project I know that lets you do it but it would
be great to see other people get on board and and start developing

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 2 points3 points4 points 19 hours ago (0
children)

Interesting tensor cores on the phone - ye local AI finetuning does
sound pretty sick

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]lkraven -1 points0 points1 point 17 hours ago (0 children)

e

  * permalink
  * embed
  * save
  * report
  * reply

[+][deleted] 18 hours ago (2 children)

[deleted]

[-]tompute 8 points9 points10 points 18 hours ago (1 child)

They claim faster performance with 0% loss of accuracy. They are not
claiming 0% accuracy. There a difference...

  * permalink
  * embed
  * save
  * report
  * reply

[-]danielhanchen[S] 3 points4 points5 points 17 hours ago (0
children)

Thanks for that! Ye so all there's no approximation methods - all
exact computations - we just did some maths and coding trickery :)
Oops maybe I should have worded the title better

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]danielhanchen[S] 0 points1 point2 points 14 hours ago (0 children)

Also join our Discord if you wanna chat AI and stuff or learn more
about Unsloth! https://discord.gg/AecqJdXGz5

  * permalink
  * embed
  * save
  * report
  * reply

[-]arnott 0 points1 point2 points 5 hours ago (4 children)

May be off topic: is it possible to finetune based on custom set of
documents and not only with prompts & answers?

  * permalink
  * embed
  * save
  * report
  * reply

[-]FullOf_Bad_Ideas 1 point2 points3 points 3 hours ago (3 children)

What kind of output do you want your model to do? Should it just
hallucinate continuation of documents? If so, yes.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]arnott 0 points1 point2 points 3 hours ago (2 children)

No hallucinations. Answer questions based on the documents.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]FullOf_Bad_Ideas 1 point2 points3 points 3 hours ago (1 child)

Then you would need to transform documents into question answer
dataset using some llm. Also, this is not something that works well
with lora/qlora. Even after training on transformed dataset, you will
probably still get a ton of hallucinations, lora is more about
transfer style rather than teaching new knowledge. What you want is
just RAG.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]arnott 0 points1 point2 points 2 hours ago (0 children)

Ok, thanks.

  * permalink
  * embed
  * save
  * parent
  * report
  * reply

[-]ajibawa-2023 0 points1 point2 points 5 hours ago (0 children)

Interesting development! I have fully finetuned 17 models but never
tried LoRA or qLoRA. I will try it out. Thanks & keep up the good
work!

  * permalink
  * embed
  * save
  * report
  * reply

[-]kaszebe 0 points1 point2 points 5 hours ago (0 children)

Hi OP, u/danielhanchen

Is there a "guide for complete morons" that will allow a n00b like me
to fine tune with your finetuning? I have a 4090 gaming rig. Also, do
I need to provide the system with a ton of source material? (e.g.
scraped websites) or can I just provide it with a list of
instructions that I want it to follow every time it writes something
for me (e.g. "don't use passive voice,"write at a college level"
etc)?

I'm a writer and I use AI to help me write. thank you

  * permalink
  * embed
  * save
  * report
  * reply

[-]dervu 0 points1 point2 points 3 hours ago (0 children)

Anyone can help newbie in AI training if it is worth doing fine
tuning such model when I have one 4090 24GB? I would like to fine
tune it on project code that otherwise would be not good to leak to
external AIs.

I would either like to fine tune on smaller project first, then on
bigger one.

Is setting it up, preparing code, and time spent to train it on one
GPU worth the hassle to have give answers regarding this code project
and maybe help with alternate approaches to code?

  * permalink
  * embed
  * save
  * report
  * reply

[-]athirdpath 0 points1 point2 points 2 hours ago (0 children)

Thank you so much!

Do you intend to add DPO training support?

  * permalink
  * embed
  * save
  * report
  * reply

[-]watkykjynaaier 0 points1 point2 points 1 hour ago (0 children)

Was the decision to adopt the Apple-ish Pro/Max product segmentation
intentional? Bc to me it implies an association with the M chips and
that confused me, especially now that I've seen this won't run on
Apple GPUs at all. If you're still calibrating your product offering
I would strongly suggest a renaming.

  * permalink
  * embed
  * save
  * report
  * reply

[-]ii-___-ii 0 points1 point2 points 23 minutes ago (0 children)

Great work

  * permalink
  * embed
  * save
  * report
  * reply

[-]oc-homelabber 0 points1 point2 points 6 minutes ago (0 children)

Just an FYI. On the GH page, it links to "https://www.unsloth.ai" and
that link doesn't work. I had to go to "https://unsloth.ai" to visit
the webpage.

  * permalink
  * embed
  * save
  * report
  * reply

  * about
  * blog
  * about
  * advertising
  * careers

  * help
  * site rules
  * Reddit help center
  * reddiquette
  * mod guidelines
  * contact us

  * apps & tools
  * Reddit for iPhone
  * Reddit for Android
  * mobile website

  * <3
  * reddit premium

Use of this site constitutes acceptance of our User Agreement and
Privacy Policy. (c) 2023 reddit inc. All rights reserved.

REDDIT and the ALIEN Logo are registered trademarks of reddit inc.

[pixel]

p Rendered by PID 97888 on
reddit-service-r2-loggedout-dbcc8b4b7-62nzk  at 2023-12-01
23:00:37.542020+00:00 running a450159 country code: US.