[HN Gopher] Self-Play Fine-Tuning Converts Weak Language Models ...
___________________________________________________________________
Self-Play Fine-Tuning Converts Weak Language Models to Strong
Language Models
Author : ColinWright
Score : 30 points
Date : 2024-01-03 20:46 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bbstats wrote:
| what a chaotic way to name that backronym
| londons_explore wrote:
| I can't really take seriously any research with "elevates the LLM
| from a nascent model to a formidable one" in the abstract...
|
| If you want to catch my attention, say "+XX% at [benchmark], with
| the same number of weights and training data".
| eigenket wrote:
| In their defence, the authors are
|
| > Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan
| Gu
|
| I'm guessing from how the introduction is written they aren't
| native English speakers, so they shouldn't be judged as if they
| are. Presumably the stuff they write sounds more normal in
| their native language(s).
| londons_explore wrote:
| I considered that... but no matter the language, in science,
| numbers are always preferred over "wow big amazing woo!"
| andy99 wrote:
| The authors all have UCLA email addresses. Having a foreign
| name doesn't mean you can't speak English.
| eigenket wrote:
| I never said that having a foreign name doesn't mean you
| can't speak English, but writing a serious paper where the
| first sentence starts
|
| > Large Language Models (LLMs) have began a groundbreaking
| era in artificial general intelligence
|
| Pretty strongly suggests that your work should be judged on
| its scientific (and not linguistic) merit.
| gchamonlive wrote:
| Tuning down these qualifiers perceived as subjective and
| focusing on the important stuff is a fine skill to develop.
| Maybe this is a useful research for you and you are passing it
| down because of cosmetics.
| leothetechguy wrote:
| Just read the paper, or open the paper and scroll down to the
| first bar graph you see.
| dash2 wrote:
| This seems like a very clever idea because it is so obvious. Is
| this the kind of thing that OpenAI will be doing anyway behind
| closed doors?
| tempusalaria wrote:
| They say it doesn't need preference data, but it seems to me that
| this does use preference data - the preferred response is from
| GPt-4, and the non-preferred response is from their model. It
| doesn't fundamentally obviate the need to collect a high quality
| dataset from somewhere else.
|
| In AlphaGo self play, the only external data was grandmaster Go
| moves that were used in a first pretraining phase of the policy
| network, and in AlphaGo Zero there was no external data at all.
| That's what I would understand as self play really.
|
| Seems to be more efficient than DPO - will try it out to compare
___________________________________________________________________
(page generated 2024-01-03 23:01 UTC)