[HN Gopher] Direct Nash Optimization: Teaching language models t...
___________________________________________________________________
Direct Nash Optimization: Teaching language models to self-improve
Author : tosh
Score : 41 points
Date : 2024-04-08 19:16 UTC (3 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tosh wrote:
| > 7B parameter Orca-2.5 model aligned by DNO achieves the state-
| of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0
| (even after controlling for response length), an absolute gain of
| 26% (7%-33%) over the initializing model. It outperforms models
| with far more parameters, including Mistral Large, Self-Rewarding
| LM (70B parameters), and older versions of GPT-4
|
| edit: updated quote w/ more context
| firejake308 wrote:
| Still only 33% though. Which is impressive for a 7B model, but
| the student has not yet surpassed the teacher
| danielcampos93 wrote:
| If you only emulate the teacher can you ever become an
| master?
| tosh wrote:
| It works in many competitive fields
| Twirrim wrote:
| Surely it needs more than just emulating the teacher,
| though, it would require exploring beyond those
| limitations?
| CapeTheory wrote:
| But eventually your human mentor diminishes with age.
|
| And then you can crush them.
| corbyrosset wrote:
| author here; yea the goal is to try and get the student to
| surpass the teacher, but if it can't, this is the best way to
| get close. For these contrastive losses, our intuition is
| that the model isn't trying to emulate the teacher so much as
| learning from the 'delta' between itself and the teacher
| ekojs wrote:
| While the math seems intimidating, it does not look all too
| different from SPIN and previous researches. Pretty surprising
| how effective this is though. The costs here seems to be way
| higher too (with all the GPT4 calls).
|
| > We also do a brief cost analysis associated with the scaled-up
| experiment on 600k training inputs. The major line items are the
| cost of sampling outputs, annotating them with GPT-4 to construct
| training pairs, and then training the next iteration against
| those pairs. For _each_ of the six iterations:
|
| > 1. Sampling: it took about 18-24 hours to inference 5 outputs
| for all 100k examples on 10 8xA100 80GB pods, depending on the
| average length, costing about $6,000 based on spot pricing.
|
| > 2. Annotation: the average number of prompt tokens sent to
| GPT-4 for annotation across iterations was about 450M, with an
| average of about 60M completion tokens, amounting to about
| $34,000 based on the version of the endpoint we were using.
|
| > 3. Training: ironically, training was the cheapest step, taking
| only 12-24 hours on two 8xA100 80GB nodes
| visarga wrote:
| Yeah most of the cost in the future will be in preparing the
| training data by doing inference. This is the only way models
| can learn from their mistakes.
| dr_dshiv wrote:
| What is the UI for human preference then? Is it still just asking
| people to pick the best of two options?
| Grimblewald wrote:
| Do papers like these ever make the data they generate publically
| avalible or are we expected to pay the same api fees if we ever
| want to verify the work?
___________________________________________________________________
(page generated 2024-04-08 23:01 UTC)