[HN Gopher] Our Humble Attempt at "How Much Data Is Needed to Fi...
       ___________________________________________________________________
        
       Our Humble Attempt at "How Much Data Is Needed to Fine-Tune
        
       Author : gnahzby
       Score  : 31 points
       Date   : 2023-09-24 20:23 UTC (2 hours ago)
        
 (HTM) web link (barryzhang.substack.com)
 (TXT) w3m dump (barryzhang.substack.com)
        
       | tomohelix wrote:
       | Is this something like short term vs long term memory? The
       | context window for LLMs is its short term memory where you can
       | tell it to do things or quickly define something and the LLM can
       | learn very quickly even with just 1 example or a sentence. But it
       | forgets immediately once the work is done. But for finetuning, it
       | commits the knowledge into its weight network and have a "deeper"
       | understanding? The cost is it takes more effort and energy to do
       | so?
       | 
       | If so, let say in the future, we have an LLM with 100K token
       | context windows but with a subsystem where it notices some
       | knowledge keeps being repeated in the context and then store that
       | knowledge for finetuning when the LLM is not doing inference.
       | Basically a mirror of the way we human work? Is that possible? An
       | LLM that constantly improved and can adapt to new knowledge?
        
         | BoorishBears wrote:
         | Fine tuning is mostly useless for direct addition of knowledge.
         | 
         | You can use it to improve knowledge in indirect ways:
         | 
         | - get the model better at crafting queries for an external data
         | source
         | 
         | - get the model better a tool usage to do computation with an
         | external system
         | 
         | - get more useful embeddings from BERT/SBERT
         | 
         | - tell the model what it cannot answer accurately
         | 
         | But in general, fine tuning is noise right now because 99% of
         | the people chasing it actually don't need it.
         | 
         | If you want to change how the model presents text, use fine
         | tuning. If you want to change what the model can present, fine
         | tuning is a hopeless way of doing it.
        
       | joewferrara wrote:
       | They test two fine tuning tasks in the article - reliable output
       | formatting and custom tone. These are two tasks (reliable output
       | formatting in particular) that are advertised regularly as areas
       | where fine tuning an LLM should work. The goal is not to change
       | what the LLM knows, but to change how the LLM communicates what
       | it knows. In theory the user wants to leverage the LLMs knowledge
       | base and using the different output format is more useful to the
       | user.
       | 
       | The hard question IMO is the question of when does it make sense
       | to fine tune an LLM to update it's knowledge and how much data is
       | needed in this case? I have not seen anyone show a real example
       | of succeeded in this case and wonder if it's close to as
       | difficult as training the LLM from scratch or if it's a feasible
       | fine tuning use case.
        
         | dnnssl2 wrote:
         | Knowledge instillation is probably the holy grail of fine
         | tuning. The hard part is:
         | 
         | 1. Generalizing new facts. You can create a question answer
         | pair of: "what is the population of the world in 2023?" "8
         | billion", but it may not be able to pick up alternate phrasing
         | or "does the world have 8 billion people on it?"
         | 
         | 2. Catastrophic and behavioral forgetting. Continued fine
         | tuning after RLHF and instruction fine tuning may result in the
         | loss of the alignment and instruction following capabilities
         | trained by OpenAI. At worst, it will start spewing random
         | tokens like the example in the post.
         | 
         | I have not yet seen it successfully done, and I suspect that
         | updating fractions (~.1%) of the original weights with PEFT
         | methods won't help.
        
           | BoorishBears wrote:
           | Your answer is not really answering and is liable to confuse
           | someone asking the question this person asked... the answer
           | to their question is a simple: No.
           | 
           | Current fine tuning techniques can only contribute to
           | knowledge indirectly (getting better queries for an external
           | data source for example), you cannot directly embed new facts
           | in the model is any generally efficient/effective manner.
           | 
           | There are toy examples of fine tuning in facts that are not
           | of use outside of academic considerations at this point, and
           | I sense it's contributing to the widespread confusion about
           | fine-tuning's value proposition
        
             | dnnssl2 wrote:
             | There are a few reputable academic examples of factual
             | editing, such as: https://rome.baulab.info/
             | 
             | I don't believe that the answer is strictly no. There are
             | still many questions around the fine tuning method and the
             | scale of data, as well as expectations of task accuracy
             | from the perspective of an end user.
        
         | ozr wrote:
         | Fwiw, unpublished testing on LLaMA-1 13B showed that it was
         | able to learn a new word and it's meaning via PEFT with <50
         | examples. Finetuning can unquestionably add new data to a
         | model.
         | 
         | Jeremy Howard has written a bit about how quickly LLMs can pick
         | up new concepts as well:
         | 
         | https://www.fast.ai/posts/2023-09-04-learning-jumps/
        
           | mikeagb wrote:
           | The question of how to fine-tune to teach LLMs
           | facts/knowledge is definitely something we're interested in
           | exploring more in future work. The common opinion seems to me
           | at least to be that fine-tuning is more to teach the model
           | how to use the knowledge it already has to complete a
           | specific task rather than to instill new knowledge, and that
           | RAG should be sued to provide more specific context. However,
           | I personally believe there is potential in fine-tuning for
           | "memorization" or learning, and am excited to see new
           | developments in the field.
        
       ___________________________________________________________________
       (page generated 2023-09-24 23:00 UTC)