[HN Gopher] OpenAI Reinforcement Fine-Tuning Research Program
       ___________________________________________________________________
        
       OpenAI Reinforcement Fine-Tuning Research Program
        
       Author : marban
       Score  : 96 points
       Date   : 2024-12-06 18:37 UTC (4 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | mistrial9 wrote:
       | In a final lecture at UC Berkeley this semester, Dawn Song was
       | very clear that malicious fine tuning is a top priority among
       | implementers right now.
       | 
       | "Towards building safe and trustworthy AI Agents and a Path for
       | Science- and Evidence-based AI Policy."
        
         | ada1981 wrote:
         | Say more...
        
           | BoorishBears wrote:
           | You can strip most alignment from these models with
           | finetuning.
           | 
           | Generalized finetunes meant to uncensor the model _generally_
           | tend to underperform... but if you have a quality dataset for
           | very specific task that typically would go against the
           | alignment of the model, it 's trivial to finetune on the task
           | and get full performance down stream.
        
             | staticman2 wrote:
             | You are using the terms "uncensored" "malicious" and
             | "unaligned" interchangeably.
             | 
             | There would appear to be a few issues with that, the most
             | obvious being the uncensored model would presumably be
             | "aligned" with what the finetuner wants.
        
               | BoorishBears wrote:
               | I didn't use two of those three terms, so maybe
               | confirming you read the comment you replied to is in
               | order?
               | 
               | "Uncensored" is a broad phrase but those in post-training
               | community who post-train "uncensored" versions of a
               | models have a very specific meaning: the creator is
               | stripping refusals.
               | 
               | They do it via techniques like abliteration, or SFT on
               | "toxic" datasets, but the toxic datasets tend to be low
               | quality answers and abliteration is imprecise... so you
               | get a model that's generally inferior.
               | 
               | "Alignment" is an overloaded term for something as high-
               | dimensionality as an LLM, but usually uncensoring is
               | _not_ trying to change the  "alignment" if we define
               | alignment as biases on specific topics as you seem to be
               | hinting at.
               | 
               | Only a few very specific projects actually try to change
               | that, and it goes past basic "uncensoring".
               | 
               | Some creative writing models for example, might past
               | uncensoring to "darkening", where they try to rid the
               | model of a tendancy to introduce positive plot points
               | when writing and lean more into villans/negative outcomes
               | in stories
               | 
               | Or someone might finetune to get a more conservative
               | leaning model in terms of talking points. But again,
               | that's all orthogonal to the popular meaning of
               | "uncensored" in the post-training community.
               | 
               | -
               | 
               | The alternative to a generally "uncensored" model (ie.
               | refusals stripped actively) is what I'm describing:
               | taking a task where the "alignment", specifically the
               | safety alignment, would causes refusals. Then producing
               | examples where the model did many versions of the task
               | and post-training on them.
               | 
               | For example, fine tuning on 10k examples where the model
               | was given a very specific prompt template to produce code
               | and produced a JSON block with said code.
               | 
               | If you post train on that highly specific template, to
               | the point of _slightly_ overfitting, you get a model that
               | will now, when given the exact prompt template from the
               | training, will always produce code in a JSON block,
               | without refusals.
               | 
               | If you inspect the logits it produces, the logits for a
               | refusal no longer appear for the model to pick.
               | 
               | The examples don't even have to be examples the base
               | model would refuse, the model just learns so strongly
               | that "When given this prompt, the output is valid code in
               | this format", that the original safety post-training no
               | longer activates.
               | 
               | If you take the original prompt format and ask for
               | malware for example, the model will produce it happily.
               | 
               | -
               | 
               | For reference I've post-trained about 130 models this
               | year and work closely with a lot of people who do as
               | well.
               | 
               | I think as an outsider you're assuming most people are
               | aligning the models with an agenda, but realistically
               | there's a massive contingent that doesn't care about what
               | the alignment _has_, they care what it _doesn't_ have,
               | which is refusals.
               | 
               | tl;dr they don't train the model so it will specifically
               | say "Biden is better than Trump" or vice versa.
               | 
               | They train that so if you ask "Is Biden is better than
               | Trump?" it answers your question without 10 paragraphs of
               | disclaimers or an outright refusal.
        
       | patrickhogan1 wrote:
       | Who owns the fine tuning IP. Can OpenAI resell your model after
       | investing a lot in it?
        
         | kcorbitt wrote:
         | No, generally speaking OpenAI doesn't re-use training data
         | between customers. It's worth it to them anyway because they
         | learn what does/doesn't work on different tasks
         | 
         | Of course, it isn't your IP free and clear either, because the
         | base model isn't open so your fine-tuned model will always live
         | inside OpenAI's walled garden.
         | 
         | If you're interested in reinforcement learning on top of truly
         | open models where you own the end product, we're putting a lot
         | of thought into that and are _also_ looking for design
         | partners! Feel free to email me at kyle@openpipe.ai.
        
       | brandonb wrote:
       | What are the advantages of reinforcement learning over DPO
       | (Direct Preference Optimization)? My understanding is that the
       | DPO paper showed it was equivalent to RLHF, but simpler and more
       | computationally efficient.
        
         | swyx wrote:
         | you mean PPO not RLHF
         | 
         | simpler/efficient is not just about compute. its also data
         | efficient.
        
         | refulgentis wrote:
         | o1's thought chains aren't traditional shoggoth mask
         | RLHF/DPO/what have you, the reinforcement metric is the scores
         | discussed in the video.
        
         | tempusalaria wrote:
         | 1) DPO did exclude some practical aspects of the RLHF method,
         | e.g. pretraining gradients.
         | 
         | 2) the theoretical arguments of DPO equivalence make some
         | assumptions that don't necessarily apply in practice
         | 
         | 3) RLHF gives you a reusable reward model, which has practical
         | uses and advantages. DPO doesn't have useful intermediate
         | product.
         | 
         | 4) DPO works off preference, whereas desirable RL objectives
         | could have many forms
         | 
         | in practice big labs are testing all these methods to see what
         | works best.
        
           | brandonb wrote:
           | Thanks! This is exactly what I was asking.
        
         | changoplatanero wrote:
         | Note that this reinforcement finetuning is something different
         | than regular RLHF/DPO post training
        
           | whimsicalism wrote:
           | Is it? We have no idea.
        
         | whimsicalism wrote:
         | Most of the other replies to you, except for the one by
         | tempusalaria, are not really answering the question.
         | 
         | Broadly, while there was a lot of initial excitement - it
         | simply does not seem like offline + off-policy RL can beat
         | online + on-policy RL methods like PPO. Sampling trajectories
         | from the actual model you are training and scoring them seems
         | like it works much better in practice, never mind the
         | additional flexibility methods like PPO provide over the form
         | of the reward function.
        
         | danielhanchen wrote:
         | On the topic of DPO - I have a Colab notebook to finetune with
         | Unsloth 2x faster and use 50% less memory for DPO if it helps
         | anyone!
         | https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
        
       | throwup238 wrote:
       | This was announced as part of their second day of "12 Days of
       | AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc
        
       | thorum wrote:
       | Clever way to get more training data.
        
         | j_maffe wrote:
         | Yeah I was gonna say this would normally be paid for. They're
         | profiting off of the hype.
        
         | SheinhardtWigCo wrote:
         | Even just the submissions to this application form will be
         | highly insightful.
        
       | ausbah wrote:
       | this sounds like expert systems 2.0 lol
        
       | amelius wrote:
       | Is there any piece I can read that gives an overview of the ways
       | in which modern LLM networks are trained and optimized?
        
         | popol1991 wrote:
         | Checkout the TULU3 report from AI2:
         | https://arxiv.org/pdf/2411.15124
        
       ___________________________________________________________________
       (page generated 2024-12-06 23:00 UTC)