[HN Gopher] Fine-Tuning Llama-2: A Comprehensive Case Study for ...
       ___________________________________________________________________
        
       Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring
       Custom Models
        
       Author : robertnishihara
       Score  : 190 points
       Date   : 2023-08-11 16:34 UTC (6 hours ago)
        
 (HTM) web link (www.anyscale.com)
 (TXT) w3m dump (www.anyscale.com)
        
       | praveenhm wrote:
       | Is this possible to fine tune llama-2 locally on M1 Ultra 64GB, I
       | would like to know or any pointer would be good. Most of them are
       | on Cloud or using Nvidia Cuda on linux.
        
       | ilaksh wrote:
       | One challenge is that to get large enough custom datasets you
       | either need a small army or a very strong existing model. Which
       | means that you probably have to use OpenAI. And using OpenAI to
       | generate training material for another model violates their
       | terms.
       | 
       | Has anyone taken them to court about this? Do we all just decide
       | it's not fair and ignore it?
        
         | bugglebeetle wrote:
         | This is not true for all tasks. For many NLP tasks, you just
         | need to reformat existing data to match the LLM format.
        
         | sillysaurusx wrote:
         | Why not ignore ToS? The worst that can happen is that you lose
         | access.
        
           | charcircuit wrote:
           | The worst that can happen is you get brought into an
           | expensive lawsuit.
        
       | jawerty wrote:
       | Just to add to this, I run through a lot of these topics around
       | fine-tuning Llama 2 on your own dataset (for me it's my own code
       | :P) in a coding live stream a couple weeks ago. All on Colab
       | single GPU
       | 
       | Fine-tuning Llama stream:
       | https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s
       | 
       | I have a couple more one where I do a QLoRa fine tuning session
       | and explain the concepts as a personally self taught engineer
       | (software engineer of 8 years moving into ML recently)
       | 
       | QloRa fine-tuning stream:
       | https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s
       | 
       | Overall I'm trying to breakdown how I'm approaching a lot of my
       | personal projects and my current AI driven startup. Want to make
       | this information as accessible as possible. Also have a series
       | where I'm fine-tuning a model to be the smallest webdev llm as
       | possible which seems like people are liking. Only been streaming
       | for about a month and plenty more to come.
       | 
       | Ask me any question about the stream and fine-tuning llama!
        
         | purplecats wrote:
         | really need a simple "put your source stuff in this directory,
         | then press this button, then chat with your contents" type
         | app/module/library.
         | 
         | too much implementation detail required make it inaccessible
         | for any non-significant use case. i imagine privateGpt will get
         | there slowly
        
           | zora_goron wrote:
           | I wrote a simple implementation to do this in ChatGPT via
           | local plugin [0]. Obviously it doesn't hit the "fully
           | private" requirement but I imagine it would be relatively
           | straightforward to integrate into a local LLM. The question
           | is whether a local LLM would be as good at grabbing enough
           | context and nuance from the project to answer meaningfully as
           | GPT-4 is able to do with plugins.
           | 
           | [0] https://github.com/samrawal/chatgpt-localfiles
        
           | jawerty wrote:
           | One of my streams I essentially build this from scratch
           | https://www.youtube.com/watch?v=kBB1A2ot-Bw&t=236s. A
           | retriever reader model, let me know if you want the code I
           | think I like the colab in the comments but let me know if you
           | need more.
        
         | SubiculumCode wrote:
         | one gpu? feasible with one 3060?
        
           | nacs wrote:
           | Absolutely. For QLORA / 4bit / GPTQ finetuning, you can train
           | a 7B easily on an RTX 3060 (12GB VRAM).
           | 
           | If you have a 24GB VRAM GPU like a RTX 3090/4090, you can
           | Qlora finetune a 13B or even a 30B model (in a few hours).
        
             | jawerty wrote:
             | +1 this
        
             | kouroshh wrote:
             | Would be good to see a rigorous analysis of these PEFT
             | methods on quality. There still seems to be a debate on
             | whether these methods sacrifice quality or not.
        
         | sandGorgon wrote:
         | this is brilliant. could you do a series about how to prepare
         | custom data sets for finetuning. thats the part that a lot of
         | other tutorials skip on. Especially for different goals - like
         | safety, accuracy, etc.
        
           | jawerty wrote:
           | Of course I have a few where I web scrape and build a dataset
           | for myself with prefix tokens. I can break that down more on
           | a specific stream about it.
        
         | SOLAR_FIELDS wrote:
         | What is the general thought process on when it makes sense to
         | use RAG vs fine tuning?
         | 
         | How does segmenting fine tuning models make sense? Do I need a
         | terraform LLM, a SQL LLM, and a python LLM, or can I just use a
         | "code" LLM?
        
           | DebtDeflation wrote:
           | Fine tuning for training the model to perform a new task, RAG
           | for adding knowledge.
           | 
           | In your example, you would fine tune the model to train it to
           | code in a language it hasn't seen before, RAG will not really
           | help with that.
        
           | jawerty wrote:
           | I have an RAG video (my "make a ChatGPT with podcasts" video)
           | you might be interested in. Semantic search is increddible
           | and you might be surprised how good a Q/A solution is by just
           | extracting passages that answer the question.
           | 
           | Overall it depends on whether or not you can turn your data
           | into a fine-tuning data and if you can find a low parameter
           | (enough) model that can use your found contexts as input to
           | host yourself of use inference endpoints. Hosting an LLM is
           | actually not easy and I'm finding in the field working an
           | information retrieval business OpenAI isn't terrible compared
           | to costs of having a GPUs for your users across the world.
        
           | zby wrote:
           | There is an article at the original site about that:
           | https://www.anyscale.com/blog/fine-tuning-is-for-form-not-
           | fa...
           | 
           | Everybody new to this field thinks that he needs finetuning
           | to teach the LLM of new facts. I made the same mistake
           | initially, later I published a slightly ranty post on that:
           | https://zzbbyy.substack.com/p/why-you-need-rag-not-
           | finetunin...
        
             | sandGorgon wrote:
             | Quick question - Gorilla paper talks about finetuning for
             | RAG. Do you see this in practice ? can you do finetuning
             | that specifically affects RAG ?
        
       | 0xDEF wrote:
       | Has anyone had luck with fine-tuning Llama-v2-7b using the paid
       | (EUR11.00) Colab Pro?
        
       | richardliaw wrote:
       | I'm really glad to see a post like this come out. I've seen so
       | many discussions online about customizing models -- this post
       | really does cut through the noise.
       | 
       | Really like the evaluation methodology, and seems well-written as
       | well.
        
       | behnamoh wrote:
       | > Additionally, while this wasn't an issue for GPT, the Llama
       | chat models would often output hundreds of miscellaneous tokens
       | that were unnecessary for the task, further slowing down their
       | inference time (e.g. "Sure! Happy to help...").
       | 
       | That's the problem I've been facing with Llama 2 as well. It's
       | almost impossible to have it just output the desired text. It
       | will always add something before and after its response. Does
       | anyone know if there's any prompt technique to fix this problem?
        
         | redox99 wrote:
         | Use a better model.
         | 
         | airoboros supports the PLAINFORMAT token "to avoid backticks,
         | explanations, etc. and just print the code".
         | 
         | https://huggingface.co/TheBloke/airoboros-l2-70B-GPT4-2.0-GG...
        
           | mikeravkine wrote:
           | The model card also has prompt formats for context aware
           | document Q/A and multi-CoT, using those correctly improves
           | performance at such tasks significantly.
        
           | crooked-v wrote:
           | It's not useful for code, but you can see the difference of
           | approach with NovelAI's homegrown Kayra model, which is set
           | up to handle a mix of text completion and instruct
           | functionality. It never includes extraneous prefix/suffix
           | text and will smoothly follow instructions embedded in text
           | without interrupting the text.
        
           | behnamoh wrote:
           | Thanks, I'll give this a try.
           | 
           | I wonder if LLMs will have less reasoning power if they
           | simply return the output. AFAIK, they think by writing their
           | thoughts. So forcing an LLM to just return the goddamn code
           | might limit its reasoning skills, leading to poor code. Is
           | that true?
        
             | dontupvoteme wrote:
             | You can also just parse the text for all valid code blocks
             | and combine them. I have a script which automatically check
             | the clipboard for this
             | 
             | There's no reason to handle the LLM side of things, unless
             | you want to try and optimize the amount of tokens which are
             | code vs comments vs explanations and such. (Though you
             | could also just start a new context window with only your
             | code or such)
        
             | redox99 wrote:
             | Potentially it could have an impact if it omits a high
             | level description before writing the code, although
             | obviously things like "Sure! Happy to help" do not help.
             | 
             | In practice I haven't seen it make too much of a difference
             | _with GPT_. The model can still use comments to express
             | itself.
             | 
             | For non coding tasks, adding "Think step by step" makes a
             | huge difference (versus YOLOing a single word reply).
        
               | behnamoh wrote:
               | > although obviously things like "Sure! Happy to help" do
               | not help.
               | 
               | Yes you're right. I'm mostly concerned with the text that
               | actually "computes" something before the actual code
               | begins. Niceties like "sure! happy to help" don't compute
               | anything.
               | 
               | CoT indeed works. Now I've seem people take it to the
               | extreme by having tree of thoughts, forest of thoughts,
               | etc. but I'm not sure how much "reasoning" we can extract
               | from a model that is obviously limited in terms of
               | knowledge and intelligence. CoT already gets us to 80% of
               | the way. With some tweaks it can get even better.
               | 
               | I've also seen simulation methods where GPT "agents" talk
               | to each other to form better ideas about a subject. But
               | then again, it's like trying to achieve _perpetual
               | motion_ in physics. One can 't get more intelligence from
               | a system than one puts in the system.
        
               | kaibee wrote:
               | > But then again, it's like trying to achieve perpetual
               | motion in physics. One can't get more intelligence from a
               | system than one puts in the system.
               | 
               | Not necessarily the same thing, as you're still putting
               | in more processing power/checking more possible paths.
               | Its kinda like simulated annealing, sure the system is
               | dumb, but as long as checking if you have a correct
               | answer is cheap, it still narrows down the search space a
               | lot.
        
               | behnamoh wrote:
               | > Its kinda like simulated annealing.
               | 
               | Yeah I get that. We assume there's X amount of
               | intelligence in the LLM and try different paths to tap on
               | that potential. The more paths are simulated, the closer
               | we get to the LLM's intelligence asymptote. But then
               | that's it--we can't go any further.
        
         | kouroshh wrote:
         | Llama-2-chat models have been overly fine-tuned to be like
         | this. You can give a few-shot prompting a try, but they still
         | don't gurantee a desired output. The best way to guarantee is
         | to fine-tune on small (~1k) data points and go from there.
        
       | spdustin wrote:
       | Seeing NER examples pop up more frequently now, and wondering why
       | folks don't use spacy for those sorts of tasks.
        
         | techwizrd wrote:
         | I use a fine-tuned BERT-like model for NER, but I'd be
         | interested to compare how it performs.
        
         | bugglebeetle wrote:
         | Spacy doesn't work well for multilingual training data and I've
         | found it barfs in more and somehow even odder ways than stuff
         | in transformers.
        
         | binarymax wrote:
         | My line of thinking is using the more expensive model to label
         | data, then use a teacher/student methodology to train the
         | smaller model (SpaCy or BERT) for cost & speed.
        
       | [deleted]
        
       | rising-sky wrote:
       | > ~14 min. for 7B for 1 epoch on 3.5M tokens. ~26 min for 13B for
       | 1 epoch.
       | 
       | > At least 1xg5.16xlarge for head-node and 15xg5.4xlarge for
       | worker nodes for both 7B and 13B
       | 
       | For the uninitiated, anyone have an idea how much this would cost
       | on AWS?
        
         | grandpayeti wrote:
         | g5.16xlarge - $4.0960/hour
         | 
         | g5.4xlarge - $1.6240/hour
         | 
         | You're looking at about $30/hour to run this in us-east-1.
         | 
         | https://instances.vantage.sh/?selected=g5.16xlarge,g5.4xlarg...
        
           | rising-sky wrote:
           | thanks
        
       | yousif_123123 wrote:
       | It's weird that Lora and training with quantization is not being
       | taken more seriously. It's way cheaper, takes less time, and a
       | lot of evidence shows it's pretty good.
       | 
       | I don't think it should be something brushed on the side to be
       | tried out later..
        
         | perplexitywiz wrote:
         | https://twitter.com/Tim_Dettmers/status/1689375417189412864
        
           | DebtDeflation wrote:
           | I'm not sure to whom he is responding, since no one is
           | claiming LoRA performs as well as traditional fine tuning. If
           | you click through to the original Tweet he shared, it says
           | "when you have a lot of data and limited compute go for LoRA,
           | while with limited data and ample compute go for full
           | finetuning" which I think is absolutely correct and few would
           | disagree. As these models get bigger and bigger though, fewer
           | and fewer people are going to have the "ample compute"
           | required for full fine tuning.
        
             | scv119 wrote:
             | The tweet is referring to a paper that fine tunes Chinese
             | dataset on english base model. I'm not surprised with
             | LoRA's poor result in this setup.
        
             | yousif_123123 wrote:
             | I'm not sure less data should require full fine-tuning. If
             | I had 5 pages of text, I don't see why I need to train
             | billions of parameters that are already trained pretty well
             | on general internet knowledge, and already know how to
             | chat..
             | 
             | From a practical perspective, unless cost is really
             | immaterial, I think most will end up starting with Lora,
             | especially for 13b or 70b models.. you could do 10 fine-
             | tuning runs for the cost of a few full fine-tunings.
             | 
             | But it's still all witchcraft to me to some degree, and I'd
             | probably try full and Lora.
        
       | bugglebeetle wrote:
       | Glad to see the NER-like task performed the best, as I was just
       | about to test something like this for comparison with a fine-
       | tuned BERT model. Any idea about the training costs for this
       | task?
        
         | binarymax wrote:
         | Great question. I wish they said how long the 10 epochs took,
         | so we could figure out the cost (or better, just posted the
         | time and cost together):
         | 
         |  _" For the 7B and 13B models, we used 16xA10Gs, and for the
         | 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances).
         | When using Ray, there's no need to secure A100s to perform
         | full-parameter fine-tuning on these models! The process is
         | simply repeated for each task. Figures below show an example
         | run based on a context length of 512, with a total of 3.7M
         | effective tokens per epoch on GSM8k dataset.
         | 
         | We ran the training for a maximum of 10 epochs and selected the
         | best checkpoint according to the minimum perplexity score on
         | the validation set."_
        
           | kouroshh wrote:
           | Training times for GSM8k are mentioned here:
           | https://github.com/ray-
           | project/ray/tree/master/doc/source/te...
        
         | kouroshh wrote:
         | Hey, I am one of the co-authors of the post. So the training
         | data for ViGGO has about 5.1k rows which we trained with a
         | block size of 512 (you can lower the block size if you want but
         | we didn't do so because it was just easier to not change code
         | :)). On 16xA10Gs for 7B it took ~15 min per epoch and on 13B it
         | took ~25 min per epoch. So the on-demand cost per epoch is
         | ~$7.2 for 7B and ~$12 for 13B. This is based on the time only
         | spent on the training part and does not take into account the
         | cluster startup time and shutdown time.
        
           | bugglebeetle wrote:
           | Great! Thank you!
        
       ___________________________________________________________________
       (page generated 2023-08-11 23:00 UTC)