[HN Gopher] Ask HN: Most efficient way to fine-tune an LLM in 2024?
       ___________________________________________________________________
        
       Ask HN: Most efficient way to fine-tune an LLM in 2024?
        
       In Apr 2024 what is the most efficient way to fine-tune an LLM?  In
       particular we are trying to understand performance vs. cost trade-
       offs. We don't have a budget to train from scratch.  We are working
       with a proprietary data set on the order of 100M tokens and are
       looking to fine-tune a general purpose language model and also
       create task-specific models based on the same corpus.  Any help
       would be appreciated!
        
       Author : holomorphiclabs
       Score  : 81 points
       Date   : 2024-04-04 19:01 UTC (4 hours ago)
        
       | alxgt wrote:
       | Interested
        
       | jasonjmcghee wrote:
       | The approach I see used is axolotl with QLoRA using cloud GPUs
       | which can be quite cheap.
       | 
       | https://github.com/OpenAccess-AI-Collective/axolotl
       | 
       | Someone from one of the cloud GPU vendors wrote a guide:
       | https://brev.dev/blog/fine-tuning-mistral
        
       | dosshell wrote:
       | I know this is maybe not the answer you want, but if you are just
       | interested in getting the job done there exist companies that are
       | experts on this, for example:
       | 
       | https://fortune.com/2024/03/11/adaptive-startup-funding-falc...
        
         | uptownfunk wrote:
         | Also interested in this. Does this task really require such
         | specialized knowledge?
        
           | ilaksh wrote:
           | The first thing that is required is to define what they are
           | trying to do. In other words, list some question and answer
           | examples. It's amazing how many people are unwilling or
           | unable to do this and just jump to "we need to train a custom
           | model". To do what exactly, or answer what kinds of
           | questions? I have actually had multiple clients refuse to do
           | that.
        
       | dvt wrote:
       | I think you may be misunderstanding what fine tuning does. It
       | does _not_ teach the model new knowledge. In fact, Meta has a
       | paper out that argues you only need a data set of 1000[1] to
       | achieve pretty good alignment (fine-tuning) results. (100M is way
       | overkill.) For knowledge retrieval, you need RAG (usually using
       | the context window).
       | 
       | [1] https://arxiv.org/pdf/2305.11206.pdf
        
         | ramoz wrote:
         | Depending on the application, you would do continued
         | pretraining over new tokens to gain new knowledge. 100M tokens
         | is applicable here.
         | 
         | You would fine-tune, certainly, for domain-specific tasks, and
         | would curate a subset of the 100M tokens. Total tokens in
         | alignment study references is 1,000,000.
         | 
         | RAG is a hacky way to interpolate new knowledge with a base
         | model. Not always reliable nor easy to integrate into task-
         | specific workflows.
        
           | Xenoamorphous wrote:
           | When I first played with RAG I thought "wow this is so cool".
           | Now I'm starting to think it's kinda useless, in the sense
           | that the critical bit is the initial search, and that doesn't
           | use the LLM power, or at most it's used to capture the user
           | intent and reformulate the query.
           | 
           | We're building some "smart search" functionality for some
           | teams and I start to wonder if a traditional search results
           | list (i.e. sans the LLM, or used only to rewrite the user
           | query) with the document chunks wouldn't be better than
           | blindly taking the top N and feeding them to the LLM to
           | produce some response.
           | 
           | E.g. we have some docs about specific supermarket chains, but
           | the word "supermarket" might not appear at all in them, but
           | the user query might be "show me what we have about
           | supermarkets". Now the embeddings hopefully will place the
           | word "supermarket" close to, say, "Costco", but they might
           | also place it closer to "shopping center", and we might have
           | docs about shopping centers that could rank higher. So we
           | might take the top 5 docs and send them to the LLM, but the
           | docs the user was after might have been in 7th and 9th
           | position, nowhere to be seen by the LLM nor the user.
        
             | dvt wrote:
             | > We're building some "smart search" functionality for some
             | teams and I start to wonder if a traditional search results
             | list (i.e. sans the LLM, or used only ti rewrite the user
             | query) with the document chunks wouldn't be better than
             | blindly taking the top N and feeding them to the LLM to
             | produce some response.
             | 
             | Yep, it's a pretty common pattern: query -> embeddings ->
             | vector db -> records -> context -> LLM -> result.
        
               | Xenoamorphous wrote:
               | Yes that's basically the RAG pattern, but I've edited my
               | comment to elaborate a bit. I'm questioning what the LLM
               | brings to the table vs just showing the search results (a
               | long list not limited by context length) to the user.
               | 
               | The LLM doesn't even get the full docs most of the time,
               | just chunks. It has a very narrow view so its full power
               | is not used.
        
             | ec109685 wrote:
             | Another approach is to take the user query, have the LLM
             | guess the answer and use that guessed answer for the RAG
             | step.
        
             | ramoz wrote:
             | I've worked in scaled enterprise search, both with lexical
             | (lucene based, eg elastic search) & semantic search engines
             | (vector retrieval).
             | 
             | Vector retrieval that isn't contextualized in the domain is
             | usually bad (RAG solutions call this "naive rag" ... and
             | make up for it with funky chunking and retrieval
             | ensembles). Training custom retrievers and reranker is
             | often key but quite an effort and still hard to generalize
             | in a domain with broad knowledge.
             | 
             | Lexical based searching provides nice guarantees and
             | deterministic control in results (depending on how you
             | index). Certainly useful here is advanced querying
             | capability. Constructing/enriching queries with
             | transformers is cool.
             | 
             | Reranking is often nice ensemble additions, albeit can be
             | done with smaller models.
        
         | holomorphiclabs wrote:
         | Our findings are that RAG does not generalize well when
         | critical understanding is shared over a large corpus of
         | information. We do not think it is a question of either context
         | length or retrieval. In our case it is very clearly capturing
         | understanding within the model architecture itself.
        
           | ilaksh wrote:
           | Does that mean you tested on specific questions? Get 1-5
           | typical queries and test them with a properly configured
           | llamaindex.
           | 
           | If your documents repeat the same information several
           | different ways then you actually might get something out of
           | LoRA on raw documents. But you need a way to measure it and
           | you have to verify that RAG won't work with real tests first.
           | 
           | To do effective training with LoRA though and expect it to
           | pick up most of the information reliably then you need to
           | cover the knowledge and skills with multiple question answer
           | pairs for each item you expect it to learn. Which you can
           | then use QA pairs to validate that it learned those things.
           | 
           | But it's a lot of QA pair generation.
        
         | viksit wrote:
         | question: RAG by definition offloads the retrieval to a vector
         | similarity search via embeddings db (faiss, knn et al).
         | 
         | what is the preferred way to feed documents / knowledge into a
         | model so that the primary retrieval is done by the llm, and
         | perhaps use vector db only for information enhancement (a la
         | onebox)?
        
         | ozr wrote:
         | This is not correct. Fine-tuning can absolutely add new
         | knowledge to a model. It's been repeatedly demonstrated at this
         | point.
         | 
         | LIMA demonstrated that instruction-tuning and output formatting
         | could be trained with a limited number of samples, _not_ that
         | finetuning was incapable of adding new information to the
         | model.
         | 
         | It may be sub-optimal in most cases to RAG, but it does work.
        
           | simonw wrote:
           | Do you have any good links to support the idea that this has
           | been repeatedly demonstrated?
           | 
           | I've had trouble finding high quality sources of information
           | about successful applications of fine-tuning to add knowledge
           | to a model.
        
             | 2024throwaway wrote:
             | Here is a recent HN discussion of an article that talks
             | about this. https://news.ycombinator.com/item?id=39748537
             | 
             | Anecdotally, I literally "added knowledge" to a model via
             | fine-tuning earlier today.
             | 
             | Fine tuning can do extremely well given a specific question
             | and answer, the tuned model "knows" how to answer that
             | question much more accurately.
             | 
             | I gave it a specific question, and a good answer as a fine
             | tuning input. (Literally 2 data points as the input, 2
             | questions/answer sets.)
             | 
             | I asked it that question, and the tuned model blows the
             | base model away, for answering that specific question.
        
       | tdba wrote:
       | What's your measure of performance?
       | 
       | Theres no one size fits all answer yet, but if you just want to
       | test it out there are many commercial offerings on which you
       | should be able to get some results for under $10k.
        
         | holomorphiclabs wrote:
         | Are there any that are recommended? Honestly we would rather
         | not share data with any 3P vendors. It's been a painstaking
         | progress to curate it.
        
       | FezzikTheGiant wrote:
       | I was just gonna ask this question and saw this at the top of
       | Ask. Interested.
        
       | Redster wrote:
       | What LLM are you hoping to use. Have you considered using
       | HelixML? If I am reading you right, the primary concern is
       | compute costs, not human-time costs?
        
         | Redster wrote:
         | That said, I think that dvt's comment is helpful about RAG
         | likely being what you need rather than fine-tuning, but wanted
         | to offer something if you know that's what you need.
        
         | holomorphiclabs wrote:
         | We are finding there is a trade-off between model performance
         | and hosting costs post-training. The optimal outcome is where
         | we have a model that performs well on next-token prediction
         | (and some other in-house tasks we've defined) that ultimately
         | results in a model that we can host on the lowest-cost hosting
         | provider rather than be locked in. I think we'd only go the
         | proprietary model route if the model really was that much
         | better. We're just trying to save our selves weeks/months of
         | benchmarking time/costs if there was already an established
         | option in this space.
        
       | blissfulresup wrote:
       | Look into LoRa
       | 
       | https://arxiv.org/abs/2106.09685
        
         | holomorphiclabs wrote:
         | Thank you we have been exploring this.
        
       | dhouston wrote:
       | Qlora + axolotl + good foundation model (llama/mistral/etc,
       | usually instruction fine tuned) + runpod works great.
       | 
       | A single A100 or H100 with 80GB VRAM can fine tune 70B open
       | models (and obviously scaling out to many nodes/GPUs is faster,
       | or can use much cheaper GPUs for fine tuning smaller models.)
       | 
       | The localllama Reddit sub at https://www.reddit.com/r/LocalLLaMA/
       | is also an awesome community for the GPU poor :)
        
         | holomorphiclabs wrote:
         | Thank you! and yes huge fan of r/localllama :)
        
       | gardnr wrote:
       | You probably want to build a retrial augmented generation
       | pipeline.
       | 
       | If you do end up wanting to fine tune then use qlora with axolotl
       | or unsloth to prove your hypothesis on a smaller model and then
       | evaluate if you want the marginal gains you get from full
       | precision training.
       | 
       | After you fine tune it with 100m token dataset, use DPO to polish
       | it off. You need to create a DPO dataset for that but it can be
       | relatively small to get some great gains.
       | 
       | After that, look at applying grammars during inference if you are
       | expecting structured results like json.
       | 
       | You should be able to run the experiments on 4090s from vast.ai
       | or runpod or similar service.
       | 
       | It can cost less than $100 depending on your requirements.
        
         | objektif wrote:
         | Do you have any tutorials do achieve all this? Thanks.
        
       | HarHarVeryFunny wrote:
       | A possible alternative to fine-tuning is in-context learning,
       | especially if you are using a model with long context where you
       | can provide a lot of examples. Models can do one/few-shot
       | learning, but in-context learning improves the more examples you
       | give. You could experiment cheaply with Claude Haiku to see if
       | this works for you.
        
       | xianshou wrote:
       | Single-GPU, optimal efficiency: unsloth + qlora + mistral-7b on
       | runpod/vast/lambda
       | 
       | Blazing fast compared to out-of-the-box transformers, also make
       | sure to use flash attention if you have A100s or better and
       | context length >= 2k
       | 
       | Add FAISS (https://github.com/facebookresearch/faiss) if you need
       | fast local RAG
        
       | stanbiryukov wrote:
       | I recommend reviewing Stanford's dspy library - great examples of
       | few-shot learning that works by generating and tuning prompts for
       | LLMs and even distilling instruction following tasks to smaller
       | models like T5. Second, as others mentioned, using QLoRA for
       | supervised fine-tuning followed by DPO/KTO for preference
       | optimization. This strategy placed Huggingface's Zephyr and IBM's
       | Neural Chat on leaderboards for 7B parameter models. I also
       | recommend reviewing the Unsloth library which has excellent
       | accelerated examples of using these methods, along with the
       | axolotl library. Lastly, skypilot and Modal both have excellent
       | examples that showcase using axolotl to efficiently finetune
       | models on cloud GPUs. [1] https://github.com/stanfordnlp/dspy [2]
       | https://github.com/unslothai/unsloth [3]
       | https://github.com/OpenAccess-AI-Collective/axolotl [4]
       | https://github.com/skypilot-org/skypilot [5]
       | https://github.com/modal-labs/llm-finetuning
        
         | viksit wrote:
         | i looked at dspy last week, and was trying to wrap my head
         | around how it would be useful for a "fine tune" style use case
         | - where i would want to give the base model more context vs use
         | a vector DB and have the model put together a result.
         | 
         | could you give a high level way to think about how to use dspy
         | for something like this?
        
           | stanbiryukov wrote:
           | I think of dspy as a programmatic way to guide LLMs with
           | information, whether from context based on retrieval or from
           | input and output pairs, rather than traditional low-rank
           | fine-tuning. Their readme has a high-level introduction to
           | using RAG with a user defined way to pass relevant context. I
           | also found their link to Weaviate's notebooks, where dspy is
           | used with a vector DB, helpful in understanding an end-to-end
           | workflow: [1] https://github.com/weaviate/recipes/tree/main/i
           | ntegrations/d...
        
       | magdyks wrote:
       | Finetuning a LoRA based adapter using a tool like predibase.com
       | does this really fast. If you wanna go fully open source and have
       | your own hardware you can do the same thing using a ludwig +
       | lorax stack to do this yourself.
        
       | viksit wrote:
       | if i understand the problem correctly - you'd like to feed xMM
       | documents directly into an LLM so that it uses this context to
       | "reason" answers to questions, vs offload the retrieval to a
       | vector db and merely assemble results into an "answer"?
       | 
       | and since your dataset is large, the longest context windows are
       | insufficient.
        
       | netdur wrote:
       | I understand the methods to address the fine-tuning and RAG
       | issues but lack the time and possibly the technical skills to
       | implement the solution. Fine-tuning can potentially dumb down a
       | perfect model, and RAG has context limitations and may not cover
       | all content. My thinking, we should vectorize the text and embed
       | these vectors into all layers of the model at inference time.
       | This approach would bypass the context size limitations and
       | resource wastage associated with fine-tuning, as vectorization is
       | fast. I believe this vectorization and embedding strategy is the
       | solution.
        
       | objektif wrote:
       | Apologize if out of topic but could anyone please point me to a
       | resource regarding best practices of implementing RAG with either
       | proprietary LLMs like GPT?
        
       ___________________________________________________________________
       (page generated 2024-04-04 23:01 UTC)