[HN Gopher] Ask HN: Most efficient way to fine-tune an LLM in 2024?
___________________________________________________________________
Ask HN: Most efficient way to fine-tune an LLM in 2024?
In Apr 2024 what is the most efficient way to fine-tune an LLM? In
particular we are trying to understand performance vs. cost trade-
offs. We don't have a budget to train from scratch. We are working
with a proprietary data set on the order of 100M tokens and are
looking to fine-tune a general purpose language model and also
create task-specific models based on the same corpus. Any help
would be appreciated!
Author : holomorphiclabs
Score : 81 points
Date : 2024-04-04 19:01 UTC (4 hours ago)
| alxgt wrote:
| Interested
| jasonjmcghee wrote:
| The approach I see used is axolotl with QLoRA using cloud GPUs
| which can be quite cheap.
|
| https://github.com/OpenAccess-AI-Collective/axolotl
|
| Someone from one of the cloud GPU vendors wrote a guide:
| https://brev.dev/blog/fine-tuning-mistral
| dosshell wrote:
| I know this is maybe not the answer you want, but if you are just
| interested in getting the job done there exist companies that are
| experts on this, for example:
|
| https://fortune.com/2024/03/11/adaptive-startup-funding-falc...
| uptownfunk wrote:
| Also interested in this. Does this task really require such
| specialized knowledge?
| ilaksh wrote:
| The first thing that is required is to define what they are
| trying to do. In other words, list some question and answer
| examples. It's amazing how many people are unwilling or
| unable to do this and just jump to "we need to train a custom
| model". To do what exactly, or answer what kinds of
| questions? I have actually had multiple clients refuse to do
| that.
| dvt wrote:
| I think you may be misunderstanding what fine tuning does. It
| does _not_ teach the model new knowledge. In fact, Meta has a
| paper out that argues you only need a data set of 1000[1] to
| achieve pretty good alignment (fine-tuning) results. (100M is way
| overkill.) For knowledge retrieval, you need RAG (usually using
| the context window).
|
| [1] https://arxiv.org/pdf/2305.11206.pdf
| ramoz wrote:
| Depending on the application, you would do continued
| pretraining over new tokens to gain new knowledge. 100M tokens
| is applicable here.
|
| You would fine-tune, certainly, for domain-specific tasks, and
| would curate a subset of the 100M tokens. Total tokens in
| alignment study references is 1,000,000.
|
| RAG is a hacky way to interpolate new knowledge with a base
| model. Not always reliable nor easy to integrate into task-
| specific workflows.
| Xenoamorphous wrote:
| When I first played with RAG I thought "wow this is so cool".
| Now I'm starting to think it's kinda useless, in the sense
| that the critical bit is the initial search, and that doesn't
| use the LLM power, or at most it's used to capture the user
| intent and reformulate the query.
|
| We're building some "smart search" functionality for some
| teams and I start to wonder if a traditional search results
| list (i.e. sans the LLM, or used only to rewrite the user
| query) with the document chunks wouldn't be better than
| blindly taking the top N and feeding them to the LLM to
| produce some response.
|
| E.g. we have some docs about specific supermarket chains, but
| the word "supermarket" might not appear at all in them, but
| the user query might be "show me what we have about
| supermarkets". Now the embeddings hopefully will place the
| word "supermarket" close to, say, "Costco", but they might
| also place it closer to "shopping center", and we might have
| docs about shopping centers that could rank higher. So we
| might take the top 5 docs and send them to the LLM, but the
| docs the user was after might have been in 7th and 9th
| position, nowhere to be seen by the LLM nor the user.
| dvt wrote:
| > We're building some "smart search" functionality for some
| teams and I start to wonder if a traditional search results
| list (i.e. sans the LLM, or used only ti rewrite the user
| query) with the document chunks wouldn't be better than
| blindly taking the top N and feeding them to the LLM to
| produce some response.
|
| Yep, it's a pretty common pattern: query -> embeddings ->
| vector db -> records -> context -> LLM -> result.
| Xenoamorphous wrote:
| Yes that's basically the RAG pattern, but I've edited my
| comment to elaborate a bit. I'm questioning what the LLM
| brings to the table vs just showing the search results (a
| long list not limited by context length) to the user.
|
| The LLM doesn't even get the full docs most of the time,
| just chunks. It has a very narrow view so its full power
| is not used.
| ec109685 wrote:
| Another approach is to take the user query, have the LLM
| guess the answer and use that guessed answer for the RAG
| step.
| ramoz wrote:
| I've worked in scaled enterprise search, both with lexical
| (lucene based, eg elastic search) & semantic search engines
| (vector retrieval).
|
| Vector retrieval that isn't contextualized in the domain is
| usually bad (RAG solutions call this "naive rag" ... and
| make up for it with funky chunking and retrieval
| ensembles). Training custom retrievers and reranker is
| often key but quite an effort and still hard to generalize
| in a domain with broad knowledge.
|
| Lexical based searching provides nice guarantees and
| deterministic control in results (depending on how you
| index). Certainly useful here is advanced querying
| capability. Constructing/enriching queries with
| transformers is cool.
|
| Reranking is often nice ensemble additions, albeit can be
| done with smaller models.
| holomorphiclabs wrote:
| Our findings are that RAG does not generalize well when
| critical understanding is shared over a large corpus of
| information. We do not think it is a question of either context
| length or retrieval. In our case it is very clearly capturing
| understanding within the model architecture itself.
| ilaksh wrote:
| Does that mean you tested on specific questions? Get 1-5
| typical queries and test them with a properly configured
| llamaindex.
|
| If your documents repeat the same information several
| different ways then you actually might get something out of
| LoRA on raw documents. But you need a way to measure it and
| you have to verify that RAG won't work with real tests first.
|
| To do effective training with LoRA though and expect it to
| pick up most of the information reliably then you need to
| cover the knowledge and skills with multiple question answer
| pairs for each item you expect it to learn. Which you can
| then use QA pairs to validate that it learned those things.
|
| But it's a lot of QA pair generation.
| viksit wrote:
| question: RAG by definition offloads the retrieval to a vector
| similarity search via embeddings db (faiss, knn et al).
|
| what is the preferred way to feed documents / knowledge into a
| model so that the primary retrieval is done by the llm, and
| perhaps use vector db only for information enhancement (a la
| onebox)?
| ozr wrote:
| This is not correct. Fine-tuning can absolutely add new
| knowledge to a model. It's been repeatedly demonstrated at this
| point.
|
| LIMA demonstrated that instruction-tuning and output formatting
| could be trained with a limited number of samples, _not_ that
| finetuning was incapable of adding new information to the
| model.
|
| It may be sub-optimal in most cases to RAG, but it does work.
| simonw wrote:
| Do you have any good links to support the idea that this has
| been repeatedly demonstrated?
|
| I've had trouble finding high quality sources of information
| about successful applications of fine-tuning to add knowledge
| to a model.
| 2024throwaway wrote:
| Here is a recent HN discussion of an article that talks
| about this. https://news.ycombinator.com/item?id=39748537
|
| Anecdotally, I literally "added knowledge" to a model via
| fine-tuning earlier today.
|
| Fine tuning can do extremely well given a specific question
| and answer, the tuned model "knows" how to answer that
| question much more accurately.
|
| I gave it a specific question, and a good answer as a fine
| tuning input. (Literally 2 data points as the input, 2
| questions/answer sets.)
|
| I asked it that question, and the tuned model blows the
| base model away, for answering that specific question.
| tdba wrote:
| What's your measure of performance?
|
| Theres no one size fits all answer yet, but if you just want to
| test it out there are many commercial offerings on which you
| should be able to get some results for under $10k.
| holomorphiclabs wrote:
| Are there any that are recommended? Honestly we would rather
| not share data with any 3P vendors. It's been a painstaking
| progress to curate it.
| FezzikTheGiant wrote:
| I was just gonna ask this question and saw this at the top of
| Ask. Interested.
| Redster wrote:
| What LLM are you hoping to use. Have you considered using
| HelixML? If I am reading you right, the primary concern is
| compute costs, not human-time costs?
| Redster wrote:
| That said, I think that dvt's comment is helpful about RAG
| likely being what you need rather than fine-tuning, but wanted
| to offer something if you know that's what you need.
| holomorphiclabs wrote:
| We are finding there is a trade-off between model performance
| and hosting costs post-training. The optimal outcome is where
| we have a model that performs well on next-token prediction
| (and some other in-house tasks we've defined) that ultimately
| results in a model that we can host on the lowest-cost hosting
| provider rather than be locked in. I think we'd only go the
| proprietary model route if the model really was that much
| better. We're just trying to save our selves weeks/months of
| benchmarking time/costs if there was already an established
| option in this space.
| blissfulresup wrote:
| Look into LoRa
|
| https://arxiv.org/abs/2106.09685
| holomorphiclabs wrote:
| Thank you we have been exploring this.
| dhouston wrote:
| Qlora + axolotl + good foundation model (llama/mistral/etc,
| usually instruction fine tuned) + runpod works great.
|
| A single A100 or H100 with 80GB VRAM can fine tune 70B open
| models (and obviously scaling out to many nodes/GPUs is faster,
| or can use much cheaper GPUs for fine tuning smaller models.)
|
| The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/
| is also an awesome community for the GPU poor :)
| holomorphiclabs wrote:
| Thank you! and yes huge fan of r/localllama :)
| gardnr wrote:
| You probably want to build a retrial augmented generation
| pipeline.
|
| If you do end up wanting to fine tune then use qlora with axolotl
| or unsloth to prove your hypothesis on a smaller model and then
| evaluate if you want the marginal gains you get from full
| precision training.
|
| After you fine tune it with 100m token dataset, use DPO to polish
| it off. You need to create a DPO dataset for that but it can be
| relatively small to get some great gains.
|
| After that, look at applying grammars during inference if you are
| expecting structured results like json.
|
| You should be able to run the experiments on 4090s from vast.ai
| or runpod or similar service.
|
| It can cost less than $100 depending on your requirements.
| objektif wrote:
| Do you have any tutorials do achieve all this? Thanks.
| HarHarVeryFunny wrote:
| A possible alternative to fine-tuning is in-context learning,
| especially if you are using a model with long context where you
| can provide a lot of examples. Models can do one/few-shot
| learning, but in-context learning improves the more examples you
| give. You could experiment cheaply with Claude Haiku to see if
| this works for you.
| xianshou wrote:
| Single-GPU, optimal efficiency: unsloth + qlora + mistral-7b on
| runpod/vast/lambda
|
| Blazing fast compared to out-of-the-box transformers, also make
| sure to use flash attention if you have A100s or better and
| context length >= 2k
|
| Add FAISS (https://github.com/facebookresearch/faiss) if you need
| fast local RAG
| stanbiryukov wrote:
| I recommend reviewing Stanford's dspy library - great examples of
| few-shot learning that works by generating and tuning prompts for
| LLMs and even distilling instruction following tasks to smaller
| models like T5. Second, as others mentioned, using QLoRA for
| supervised fine-tuning followed by DPO/KTO for preference
| optimization. This strategy placed Huggingface's Zephyr and IBM's
| Neural Chat on leaderboards for 7B parameter models. I also
| recommend reviewing the Unsloth library which has excellent
| accelerated examples of using these methods, along with the
| axolotl library. Lastly, skypilot and Modal both have excellent
| examples that showcase using axolotl to efficiently finetune
| models on cloud GPUs. [1] https://github.com/stanfordnlp/dspy [2]
| https://github.com/unslothai/unsloth [3]
| https://github.com/OpenAccess-AI-Collective/axolotl [4]
| https://github.com/skypilot-org/skypilot [5]
| https://github.com/modal-labs/llm-finetuning
| viksit wrote:
| i looked at dspy last week, and was trying to wrap my head
| around how it would be useful for a "fine tune" style use case
| - where i would want to give the base model more context vs use
| a vector DB and have the model put together a result.
|
| could you give a high level way to think about how to use dspy
| for something like this?
| stanbiryukov wrote:
| I think of dspy as a programmatic way to guide LLMs with
| information, whether from context based on retrieval or from
| input and output pairs, rather than traditional low-rank
| fine-tuning. Their readme has a high-level introduction to
| using RAG with a user defined way to pass relevant context. I
| also found their link to Weaviate's notebooks, where dspy is
| used with a vector DB, helpful in understanding an end-to-end
| workflow: [1] https://github.com/weaviate/recipes/tree/main/i
| ntegrations/d...
| magdyks wrote:
| Finetuning a LoRA based adapter using a tool like predibase.com
| does this really fast. If you wanna go fully open source and have
| your own hardware you can do the same thing using a ludwig +
| lorax stack to do this yourself.
| viksit wrote:
| if i understand the problem correctly - you'd like to feed xMM
| documents directly into an LLM so that it uses this context to
| "reason" answers to questions, vs offload the retrieval to a
| vector db and merely assemble results into an "answer"?
|
| and since your dataset is large, the longest context windows are
| insufficient.
| netdur wrote:
| I understand the methods to address the fine-tuning and RAG
| issues but lack the time and possibly the technical skills to
| implement the solution. Fine-tuning can potentially dumb down a
| perfect model, and RAG has context limitations and may not cover
| all content. My thinking, we should vectorize the text and embed
| these vectors into all layers of the model at inference time.
| This approach would bypass the context size limitations and
| resource wastage associated with fine-tuning, as vectorization is
| fast. I believe this vectorization and embedding strategy is the
| solution.
| objektif wrote:
| Apologize if out of topic but could anyone please point me to a
| resource regarding best practices of implementing RAG with either
| proprietary LLMs like GPT?
___________________________________________________________________
(page generated 2024-04-04 23:01 UTC)