[HN Gopher] Self-Play Fine-Tuning Converts Weak Language Models ...
       ___________________________________________________________________
        
       Self-Play Fine-Tuning Converts Weak Language Models to Strong
       Language Models
        
       Author : ColinWright
       Score  : 30 points
       Date   : 2024-01-03 20:46 UTC (2 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bbstats wrote:
       | what a chaotic way to name that backronym
        
       | londons_explore wrote:
       | I can't really take seriously any research with "elevates the LLM
       | from a nascent model to a formidable one" in the abstract...
       | 
       | If you want to catch my attention, say "+XX% at [benchmark], with
       | the same number of weights and training data".
        
         | eigenket wrote:
         | In their defence, the authors are
         | 
         | > Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan
         | Gu
         | 
         | I'm guessing from how the introduction is written they aren't
         | native English speakers, so they shouldn't be judged as if they
         | are. Presumably the stuff they write sounds more normal in
         | their native language(s).
        
           | londons_explore wrote:
           | I considered that... but no matter the language, in science,
           | numbers are always preferred over "wow big amazing woo!"
        
           | andy99 wrote:
           | The authors all have UCLA email addresses. Having a foreign
           | name doesn't mean you can't speak English.
        
             | eigenket wrote:
             | I never said that having a foreign name doesn't mean you
             | can't speak English, but writing a serious paper where the
             | first sentence starts
             | 
             | > Large Language Models (LLMs) have began a groundbreaking
             | era in artificial general intelligence
             | 
             | Pretty strongly suggests that your work should be judged on
             | its scientific (and not linguistic) merit.
        
         | gchamonlive wrote:
         | Tuning down these qualifiers perceived as subjective and
         | focusing on the important stuff is a fine skill to develop.
         | Maybe this is a useful research for you and you are passing it
         | down because of cosmetics.
        
         | leothetechguy wrote:
         | Just read the paper, or open the paper and scroll down to the
         | first bar graph you see.
        
       | dash2 wrote:
       | This seems like a very clever idea because it is so obvious. Is
       | this the kind of thing that OpenAI will be doing anyway behind
       | closed doors?
        
       | tempusalaria wrote:
       | They say it doesn't need preference data, but it seems to me that
       | this does use preference data - the preferred response is from
       | GPt-4, and the non-preferred response is from their model. It
       | doesn't fundamentally obviate the need to collect a high quality
       | dataset from somewhere else.
       | 
       | In AlphaGo self play, the only external data was grandmaster Go
       | moves that were used in a first pretraining phase of the policy
       | network, and in AlphaGo Zero there was no external data at all.
       | That's what I would understand as self play really.
       | 
       | Seems to be more efficient than DPO - will try it out to compare
        
       ___________________________________________________________________
       (page generated 2024-01-03 23:01 UTC)