[HN Gopher] Fine-Tuning Llama-2: A Comprehensive Case Study for ...
___________________________________________________________________
Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring
Custom Models
Author : robertnishihara
Score : 190 points
Date : 2023-08-11 16:34 UTC (6 hours ago)
(HTM) web link (www.anyscale.com)
(TXT) w3m dump (www.anyscale.com)
| praveenhm wrote:
| Is this possible to fine tune llama-2 locally on M1 Ultra 64GB, I
| would like to know or any pointer would be good. Most of them are
| on Cloud or using Nvidia Cuda on linux.
| ilaksh wrote:
| One challenge is that to get large enough custom datasets you
| either need a small army or a very strong existing model. Which
| means that you probably have to use OpenAI. And using OpenAI to
| generate training material for another model violates their
| terms.
|
| Has anyone taken them to court about this? Do we all just decide
| it's not fair and ignore it?
| bugglebeetle wrote:
| This is not true for all tasks. For many NLP tasks, you just
| need to reformat existing data to match the LLM format.
| sillysaurusx wrote:
| Why not ignore ToS? The worst that can happen is that you lose
| access.
| charcircuit wrote:
| The worst that can happen is you get brought into an
| expensive lawsuit.
| jawerty wrote:
| Just to add to this, I run through a lot of these topics around
| fine-tuning Llama 2 on your own dataset (for me it's my own code
| :P) in a coding live stream a couple weeks ago. All on Colab
| single GPU
|
| Fine-tuning Llama stream:
| https://www.youtube.com/watch?v=TYgtG2Th6fI&t=2282s
|
| I have a couple more one where I do a QLoRa fine tuning session
| and explain the concepts as a personally self taught engineer
| (software engineer of 8 years moving into ML recently)
|
| QloRa fine-tuning stream:
| https://www.youtube.com/watch?v=LitybCiLhSc&t=4584s
|
| Overall I'm trying to breakdown how I'm approaching a lot of my
| personal projects and my current AI driven startup. Want to make
| this information as accessible as possible. Also have a series
| where I'm fine-tuning a model to be the smallest webdev llm as
| possible which seems like people are liking. Only been streaming
| for about a month and plenty more to come.
|
| Ask me any question about the stream and fine-tuning llama!
| purplecats wrote:
| really need a simple "put your source stuff in this directory,
| then press this button, then chat with your contents" type
| app/module/library.
|
| too much implementation detail required make it inaccessible
| for any non-significant use case. i imagine privateGpt will get
| there slowly
| zora_goron wrote:
| I wrote a simple implementation to do this in ChatGPT via
| local plugin [0]. Obviously it doesn't hit the "fully
| private" requirement but I imagine it would be relatively
| straightforward to integrate into a local LLM. The question
| is whether a local LLM would be as good at grabbing enough
| context and nuance from the project to answer meaningfully as
| GPT-4 is able to do with plugins.
|
| [0] https://github.com/samrawal/chatgpt-localfiles
| jawerty wrote:
| One of my streams I essentially build this from scratch
| https://www.youtube.com/watch?v=kBB1A2ot-Bw&t=236s. A
| retriever reader model, let me know if you want the code I
| think I like the colab in the comments but let me know if you
| need more.
| SubiculumCode wrote:
| one gpu? feasible with one 3060?
| nacs wrote:
| Absolutely. For QLORA / 4bit / GPTQ finetuning, you can train
| a 7B easily on an RTX 3060 (12GB VRAM).
|
| If you have a 24GB VRAM GPU like a RTX 3090/4090, you can
| Qlora finetune a 13B or even a 30B model (in a few hours).
| jawerty wrote:
| +1 this
| kouroshh wrote:
| Would be good to see a rigorous analysis of these PEFT
| methods on quality. There still seems to be a debate on
| whether these methods sacrifice quality or not.
| sandGorgon wrote:
| this is brilliant. could you do a series about how to prepare
| custom data sets for finetuning. thats the part that a lot of
| other tutorials skip on. Especially for different goals - like
| safety, accuracy, etc.
| jawerty wrote:
| Of course I have a few where I web scrape and build a dataset
| for myself with prefix tokens. I can break that down more on
| a specific stream about it.
| SOLAR_FIELDS wrote:
| What is the general thought process on when it makes sense to
| use RAG vs fine tuning?
|
| How does segmenting fine tuning models make sense? Do I need a
| terraform LLM, a SQL LLM, and a python LLM, or can I just use a
| "code" LLM?
| DebtDeflation wrote:
| Fine tuning for training the model to perform a new task, RAG
| for adding knowledge.
|
| In your example, you would fine tune the model to train it to
| code in a language it hasn't seen before, RAG will not really
| help with that.
| jawerty wrote:
| I have an RAG video (my "make a ChatGPT with podcasts" video)
| you might be interested in. Semantic search is increddible
| and you might be surprised how good a Q/A solution is by just
| extracting passages that answer the question.
|
| Overall it depends on whether or not you can turn your data
| into a fine-tuning data and if you can find a low parameter
| (enough) model that can use your found contexts as input to
| host yourself of use inference endpoints. Hosting an LLM is
| actually not easy and I'm finding in the field working an
| information retrieval business OpenAI isn't terrible compared
| to costs of having a GPUs for your users across the world.
| zby wrote:
| There is an article at the original site about that:
| https://www.anyscale.com/blog/fine-tuning-is-for-form-not-
| fa...
|
| Everybody new to this field thinks that he needs finetuning
| to teach the LLM of new facts. I made the same mistake
| initially, later I published a slightly ranty post on that:
| https://zzbbyy.substack.com/p/why-you-need-rag-not-
| finetunin...
| sandGorgon wrote:
| Quick question - Gorilla paper talks about finetuning for
| RAG. Do you see this in practice ? can you do finetuning
| that specifically affects RAG ?
| 0xDEF wrote:
| Has anyone had luck with fine-tuning Llama-v2-7b using the paid
| (EUR11.00) Colab Pro?
| richardliaw wrote:
| I'm really glad to see a post like this come out. I've seen so
| many discussions online about customizing models -- this post
| really does cut through the noise.
|
| Really like the evaluation methodology, and seems well-written as
| well.
| behnamoh wrote:
| > Additionally, while this wasn't an issue for GPT, the Llama
| chat models would often output hundreds of miscellaneous tokens
| that were unnecessary for the task, further slowing down their
| inference time (e.g. "Sure! Happy to help...").
|
| That's the problem I've been facing with Llama 2 as well. It's
| almost impossible to have it just output the desired text. It
| will always add something before and after its response. Does
| anyone know if there's any prompt technique to fix this problem?
| redox99 wrote:
| Use a better model.
|
| airoboros supports the PLAINFORMAT token "to avoid backticks,
| explanations, etc. and just print the code".
|
| https://huggingface.co/TheBloke/airoboros-l2-70B-GPT4-2.0-GG...
| mikeravkine wrote:
| The model card also has prompt formats for context aware
| document Q/A and multi-CoT, using those correctly improves
| performance at such tasks significantly.
| crooked-v wrote:
| It's not useful for code, but you can see the difference of
| approach with NovelAI's homegrown Kayra model, which is set
| up to handle a mix of text completion and instruct
| functionality. It never includes extraneous prefix/suffix
| text and will smoothly follow instructions embedded in text
| without interrupting the text.
| behnamoh wrote:
| Thanks, I'll give this a try.
|
| I wonder if LLMs will have less reasoning power if they
| simply return the output. AFAIK, they think by writing their
| thoughts. So forcing an LLM to just return the goddamn code
| might limit its reasoning skills, leading to poor code. Is
| that true?
| dontupvoteme wrote:
| You can also just parse the text for all valid code blocks
| and combine them. I have a script which automatically check
| the clipboard for this
|
| There's no reason to handle the LLM side of things, unless
| you want to try and optimize the amount of tokens which are
| code vs comments vs explanations and such. (Though you
| could also just start a new context window with only your
| code or such)
| redox99 wrote:
| Potentially it could have an impact if it omits a high
| level description before writing the code, although
| obviously things like "Sure! Happy to help" do not help.
|
| In practice I haven't seen it make too much of a difference
| _with GPT_. The model can still use comments to express
| itself.
|
| For non coding tasks, adding "Think step by step" makes a
| huge difference (versus YOLOing a single word reply).
| behnamoh wrote:
| > although obviously things like "Sure! Happy to help" do
| not help.
|
| Yes you're right. I'm mostly concerned with the text that
| actually "computes" something before the actual code
| begins. Niceties like "sure! happy to help" don't compute
| anything.
|
| CoT indeed works. Now I've seem people take it to the
| extreme by having tree of thoughts, forest of thoughts,
| etc. but I'm not sure how much "reasoning" we can extract
| from a model that is obviously limited in terms of
| knowledge and intelligence. CoT already gets us to 80% of
| the way. With some tweaks it can get even better.
|
| I've also seen simulation methods where GPT "agents" talk
| to each other to form better ideas about a subject. But
| then again, it's like trying to achieve _perpetual
| motion_ in physics. One can 't get more intelligence from
| a system than one puts in the system.
| kaibee wrote:
| > But then again, it's like trying to achieve perpetual
| motion in physics. One can't get more intelligence from a
| system than one puts in the system.
|
| Not necessarily the same thing, as you're still putting
| in more processing power/checking more possible paths.
| Its kinda like simulated annealing, sure the system is
| dumb, but as long as checking if you have a correct
| answer is cheap, it still narrows down the search space a
| lot.
| behnamoh wrote:
| > Its kinda like simulated annealing.
|
| Yeah I get that. We assume there's X amount of
| intelligence in the LLM and try different paths to tap on
| that potential. The more paths are simulated, the closer
| we get to the LLM's intelligence asymptote. But then
| that's it--we can't go any further.
| kouroshh wrote:
| Llama-2-chat models have been overly fine-tuned to be like
| this. You can give a few-shot prompting a try, but they still
| don't gurantee a desired output. The best way to guarantee is
| to fine-tune on small (~1k) data points and go from there.
| spdustin wrote:
| Seeing NER examples pop up more frequently now, and wondering why
| folks don't use spacy for those sorts of tasks.
| techwizrd wrote:
| I use a fine-tuned BERT-like model for NER, but I'd be
| interested to compare how it performs.
| bugglebeetle wrote:
| Spacy doesn't work well for multilingual training data and I've
| found it barfs in more and somehow even odder ways than stuff
| in transformers.
| binarymax wrote:
| My line of thinking is using the more expensive model to label
| data, then use a teacher/student methodology to train the
| smaller model (SpaCy or BERT) for cost & speed.
| [deleted]
| rising-sky wrote:
| > ~14 min. for 7B for 1 epoch on 3.5M tokens. ~26 min for 13B for
| 1 epoch.
|
| > At least 1xg5.16xlarge for head-node and 15xg5.4xlarge for
| worker nodes for both 7B and 13B
|
| For the uninitiated, anyone have an idea how much this would cost
| on AWS?
| grandpayeti wrote:
| g5.16xlarge - $4.0960/hour
|
| g5.4xlarge - $1.6240/hour
|
| You're looking at about $30/hour to run this in us-east-1.
|
| https://instances.vantage.sh/?selected=g5.16xlarge,g5.4xlarg...
| rising-sky wrote:
| thanks
| yousif_123123 wrote:
| It's weird that Lora and training with quantization is not being
| taken more seriously. It's way cheaper, takes less time, and a
| lot of evidence shows it's pretty good.
|
| I don't think it should be something brushed on the side to be
| tried out later..
| perplexitywiz wrote:
| https://twitter.com/Tim_Dettmers/status/1689375417189412864
| DebtDeflation wrote:
| I'm not sure to whom he is responding, since no one is
| claiming LoRA performs as well as traditional fine tuning. If
| you click through to the original Tweet he shared, it says
| "when you have a lot of data and limited compute go for LoRA,
| while with limited data and ample compute go for full
| finetuning" which I think is absolutely correct and few would
| disagree. As these models get bigger and bigger though, fewer
| and fewer people are going to have the "ample compute"
| required for full fine tuning.
| scv119 wrote:
| The tweet is referring to a paper that fine tunes Chinese
| dataset on english base model. I'm not surprised with
| LoRA's poor result in this setup.
| yousif_123123 wrote:
| I'm not sure less data should require full fine-tuning. If
| I had 5 pages of text, I don't see why I need to train
| billions of parameters that are already trained pretty well
| on general internet knowledge, and already know how to
| chat..
|
| From a practical perspective, unless cost is really
| immaterial, I think most will end up starting with Lora,
| especially for 13b or 70b models.. you could do 10 fine-
| tuning runs for the cost of a few full fine-tunings.
|
| But it's still all witchcraft to me to some degree, and I'd
| probably try full and Lora.
| bugglebeetle wrote:
| Glad to see the NER-like task performed the best, as I was just
| about to test something like this for comparison with a fine-
| tuned BERT model. Any idea about the training costs for this
| task?
| binarymax wrote:
| Great question. I wish they said how long the 10 epochs took,
| so we could figure out the cost (or better, just posted the
| time and cost together):
|
| _" For the 7B and 13B models, we used 16xA10Gs, and for the
| 70B model, we used 32xA10Gs (across 4x g5.48xlarge instances).
| When using Ray, there's no need to secure A100s to perform
| full-parameter fine-tuning on these models! The process is
| simply repeated for each task. Figures below show an example
| run based on a context length of 512, with a total of 3.7M
| effective tokens per epoch on GSM8k dataset.
|
| We ran the training for a maximum of 10 epochs and selected the
| best checkpoint according to the minimum perplexity score on
| the validation set."_
| kouroshh wrote:
| Training times for GSM8k are mentioned here:
| https://github.com/ray-
| project/ray/tree/master/doc/source/te...
| kouroshh wrote:
| Hey, I am one of the co-authors of the post. So the training
| data for ViGGO has about 5.1k rows which we trained with a
| block size of 512 (you can lower the block size if you want but
| we didn't do so because it was just easier to not change code
| :)). On 16xA10Gs for 7B it took ~15 min per epoch and on 13B it
| took ~25 min per epoch. So the on-demand cost per epoch is
| ~$7.2 for 7B and ~$12 for 13B. This is based on the time only
| spent on the training part and does not take into account the
| cluster startup time and shutdown time.
| bugglebeetle wrote:
| Great! Thank you!
___________________________________________________________________
(page generated 2023-08-11 23:00 UTC)