[HN Gopher] Direct Nash Optimization: Teaching language models t...
       ___________________________________________________________________
        
       Direct Nash Optimization: Teaching language models to self-improve
        
       Author : tosh
       Score  : 41 points
       Date   : 2024-04-08 19:16 UTC (3 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tosh wrote:
       | > 7B parameter Orca-2.5 model aligned by DNO achieves the state-
       | of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0
       | (even after controlling for response length), an absolute gain of
       | 26% (7%-33%) over the initializing model. It outperforms models
       | with far more parameters, including Mistral Large, Self-Rewarding
       | LM (70B parameters), and older versions of GPT-4
       | 
       | edit: updated quote w/ more context
        
         | firejake308 wrote:
         | Still only 33% though. Which is impressive for a 7B model, but
         | the student has not yet surpassed the teacher
        
           | danielcampos93 wrote:
           | If you only emulate the teacher can you ever become an
           | master?
        
             | tosh wrote:
             | It works in many competitive fields
        
               | Twirrim wrote:
               | Surely it needs more than just emulating the teacher,
               | though, it would require exploring beyond those
               | limitations?
        
               | CapeTheory wrote:
               | But eventually your human mentor diminishes with age.
               | 
               | And then you can crush them.
        
           | corbyrosset wrote:
           | author here; yea the goal is to try and get the student to
           | surpass the teacher, but if it can't, this is the best way to
           | get close. For these contrastive losses, our intuition is
           | that the model isn't trying to emulate the teacher so much as
           | learning from the 'delta' between itself and the teacher
        
       | ekojs wrote:
       | While the math seems intimidating, it does not look all too
       | different from SPIN and previous researches. Pretty surprising
       | how effective this is though. The costs here seems to be way
       | higher too (with all the GPT4 calls).
       | 
       | > We also do a brief cost analysis associated with the scaled-up
       | experiment on 600k training inputs. The major line items are the
       | cost of sampling outputs, annotating them with GPT-4 to construct
       | training pairs, and then training the next iteration against
       | those pairs. For _each_ of the six iterations:
       | 
       | > 1. Sampling: it took about 18-24 hours to inference 5 outputs
       | for all 100k examples on 10 8xA100 80GB pods, depending on the
       | average length, costing about $6,000 based on spot pricing.
       | 
       | > 2. Annotation: the average number of prompt tokens sent to
       | GPT-4 for annotation across iterations was about 450M, with an
       | average of about 60M completion tokens, amounting to about
       | $34,000 based on the version of the endpoint we were using.
       | 
       | > 3. Training: ironically, training was the cheapest step, taking
       | only 12-24 hours on two 8xA100 80GB nodes
        
         | visarga wrote:
         | Yeah most of the cost in the future will be in preparing the
         | training data by doing inference. This is the only way models
         | can learn from their mistakes.
        
       | dr_dshiv wrote:
       | What is the UI for human preference then? Is it still just asking
       | people to pick the best of two options?
        
       | Grimblewald wrote:
       | Do papers like these ever make the data they generate publically
       | avalible or are we expected to pay the same api fees if we ever
       | want to verify the work?
        
       ___________________________________________________________________
       (page generated 2024-04-08 23:01 UTC)