[HN Gopher] OpenAI Reinforcement Fine-Tuning Research Program
       ___________________________________________________________________
        
       OpenAI Reinforcement Fine-Tuning Research Program
        
       Author : marban
       Score  : 197 points
       Date   : 2024-12-06 18:37 UTC (1 days ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | mistrial9 wrote:
       | In a final lecture at UC Berkeley this semester, Dawn Song was
       | very clear that malicious fine tuning is a top priority among
       | implementers right now.
       | 
       | "Towards building safe and trustworthy AI Agents and a Path for
       | Science- and Evidence-based AI Policy."
        
         | ada1981 wrote:
         | Say more...
        
           | BoorishBears wrote:
           | You can strip most alignment from these models with
           | finetuning.
           | 
           | Generalized finetunes meant to uncensor the model _generally_
           | tend to underperform... but if you have a quality dataset for
           | very specific task that typically would go against the
           | alignment of the model, it 's trivial to finetune on the task
           | and get full performance down stream.
        
             | staticman2 wrote:
             | You are using the terms "uncensored" "malicious" and
             | "unaligned" interchangeably.
             | 
             | There would appear to be a few issues with that, the most
             | obvious being the uncensored model would presumably be
             | "aligned" with what the finetuner wants.
        
               | BoorishBears wrote:
               | I didn't use two of those three terms, so maybe
               | confirming you read the comment you replied to is in
               | order?
               | 
               | "Uncensored" is a broad phrase but those in post-training
               | community who post-train "uncensored" versions of a
               | models have a very specific meaning: the creator is
               | stripping refusals.
               | 
               | They do it via techniques like abliteration, or SFT on
               | "toxic" datasets, but the toxic datasets tend to be low
               | quality answers and abliteration is imprecise... so you
               | get a model that's generally inferior.
               | 
               | "Alignment" is an overloaded term for something as high-
               | dimensionality as an LLM, but usually uncensoring is
               | _not_ trying to change the  "alignment" if we define
               | alignment as biases on specific topics as you seem to be
               | hinting at.
               | 
               | Only a few very specific projects actually try to change
               | that, and it goes past basic "uncensoring".
               | 
               | Some creative writing models for example, might past
               | uncensoring to "darkening", where they try to rid the
               | model of a tendancy to introduce positive plot points
               | when writing and lean more into villans/negative outcomes
               | in stories
               | 
               | Or someone might finetune to get a more conservative
               | leaning model in terms of talking points. But again,
               | that's all orthogonal to the popular meaning of
               | "uncensored" in the post-training community.
               | 
               | -
               | 
               | The alternative to a generally "uncensored" model (ie.
               | refusals stripped actively) is what I'm describing:
               | taking a task where the "alignment" is specifically the
               | post-trained safety alignment, and that alignment would
               | causes refusals. Then producing examples where the model
               | did many versions of the task and post-training on them
               | so that the safety aspect no longer applies to the
               | outputs.
               | 
               | For example, fine tuning on 10k examples where the model
               | was given a very specific prompt template to produce code
               | and produced a JSON block with said code.
               | 
               | If you post train on that highly specific template, to
               | the point of _slightly_ overfitting, you get a model that
               | will now, when given the exact prompt template from the
               | training, will always produce code in a JSON block,
               | without refusals.
               | 
               | If you inspect the logits as it produces outputs, the
               | logits for a refusal no longer even appear for the model
               | to pick.
               | 
               | And the examples don't necessarily have to be examples
               | the base model would refused (although that helps), the
               | model just learns so strongly that "When given this
               | prompt, the output is valid code in this format", that
               | the original safety post-training no longer activates.
               | 
               | If you take the original prompt format and ask for
               | malware for example, the model will produce it happily.
               | 
               | -
               | 
               | For reference I've post-trained about 130 models this
               | year and work closely with a lot of people who do as
               | well.
               | 
               | I think as an outsider you're assuming most people are
               | aligning the models with an agenda, but realistically
               | there's a massive contingent that doesn't care about what
               | the alignment _has_, they care what it _doesn't_ have,
               | which is refusals.
               | 
               | tl;dr they don't train the model so it will specifically
               | say "Biden is better than Trump" or vice versa.
               | 
               | They train that so if you ask "Is Biden is better than
               | Trump?" it answers your question without 10 paragraphs of
               | disclaimers or an outright refusal.
        
             | torginus wrote:
             | Wonder if that is the part of the purpose. Maybe they are
             | looking to adapt the LLM to the uncensored literature
             | market, but want to distance themselves from actually
             | making a 'porn LLM' of their own, so they push this
             | functionality out to a third party finetune.
        
               | BoorishBears wrote:
               | Judging by their current SFT program, that's not true at
               | all.
               | 
               | They started off somewhat strict and have gotten to being
               | _extremely_ strict about what data you can finetune their
               | models on, running each dataset through multiple layers
               | of filtering before kicking off runs.
        
       | patrickhogan1 wrote:
       | Who owns the fine tuning IP. Can OpenAI resell your model after
       | investing a lot in it?
        
         | kcorbitt wrote:
         | No, generally speaking OpenAI doesn't re-use training data
         | between customers. It's worth it to them anyway because they
         | learn what does/doesn't work on different tasks
         | 
         | Of course, it isn't your IP free and clear either, because the
         | base model isn't open so your fine-tuned model will always live
         | inside OpenAI's walled garden.
         | 
         | If you're interested in reinforcement learning on top of truly
         | open models where you own the end product, we're putting a lot
         | of thought into that and are _also_ looking for design
         | partners! Feel free to email me at kyle@openpipe.ai.
        
       | brandonb wrote:
       | What are the advantages of reinforcement learning over DPO
       | (Direct Preference Optimization)? My understanding is that the
       | DPO paper showed it was equivalent to RLHF, but simpler and more
       | computationally efficient.
        
         | swyx wrote:
         | you mean PPO not RLHF
         | 
         | simpler/efficient is not just about compute. its also data
         | efficient.
        
         | refulgentis wrote:
         | o1's thought chains aren't traditional shoggoth mask
         | RLHF/DPO/what have you, the reinforcement metric is the scores
         | discussed in the video.
        
         | tempusalaria wrote:
         | 1) DPO did exclude some practical aspects of the RLHF method,
         | e.g. pretraining gradients.
         | 
         | 2) the theoretical arguments of DPO equivalence make some
         | assumptions that don't necessarily apply in practice
         | 
         | 3) RLHF gives you a reusable reward model, which has practical
         | uses and advantages. DPO doesn't have useful intermediate
         | product.
         | 
         | 4) DPO works off preference, whereas desirable RL objectives
         | could have many forms
         | 
         | in practice big labs are testing all these methods to see what
         | works best.
        
           | brandonb wrote:
           | Thanks! This is exactly what I was asking.
        
         | changoplatanero wrote:
         | Note that this reinforcement finetuning is something different
         | than regular RLHF/DPO post training
        
           | whimsicalism wrote:
           | Is it? We have no idea.
        
             | changoplatanero wrote:
             | Yes it is. In RLHF and DPO you are optimizing the model
             | output for human preferences. In the reinforcement fine
             | tuning that was announced today you are optimizing the
             | hidden chain of thought to arrive at a correct answer, as
             | judged by a predefined grader.
        
               | whimsicalism wrote:
               | I mean i think it could easily be PPO post training. if
               | your point is that the rewards are different, sure
        
         | whimsicalism wrote:
         | Most of the other replies to you, except for the one by
         | tempusalaria, are not really answering the question.
         | 
         | Broadly, while there was a lot of initial excitement - it
         | simply does not seem like offline + off-policy RL can beat
         | online + on-policy RL methods like PPO. Sampling trajectories
         | from the actual model you are training and scoring them seems
         | like it works much better in practice, never mind the
         | additional flexibility methods like PPO provide over the form
         | of the reward function.
        
           | eggie5 wrote:
           | What's _online_ RL for an LLM? Saw this on the llama 3.3
           | reports too...
        
             | whimsicalism wrote:
             | Online RL for LLMs means you are sampling from the model,
             | scoring immediately, and passing gradients back to the
             | model.
             | 
             | As opposed to, sampling from the model a bunch, getting
             | scores offline, and then fine tuning the model on those
             | offline scored generations.
        
         | danielhanchen wrote:
         | On the topic of DPO - I have a Colab notebook to finetune with
         | Unsloth 2x faster and use 50% less memory for DPO if it helps
         | anyone!
         | https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
        
           | hackernewds wrote:
           | thank you !
        
             | danielhanchen wrote:
             | :)
        
         | tsaoyu wrote:
         | In short, DPO is not better than PPO. This is because DPO is
         | derived from so called BT reward assumption that pairwise data
         | preference is collected. Through mathematical formulations, you
         | can learn the preference and the action at the same time.
         | However, PPO and other on-policy (training samples are strictly
         | generated by the LLM) doesn't need such assumption. For
         | example, in coding and math problems it is possible to get
         | binary reward. Many research shows DPO is ok if you don't take
         | much care on OOD performance.
        
         | freehorse wrote:
         | This is not human feedback reinforcement learning, it is just
         | traditional supervised reinforcement learning where the
         | finetuning sets consist of problems and the correct answers.
         | They do not call it supervised though because they have to say
         | it is different than how they were finetuning until now.
        
         | gwern wrote:
         | I think what people are missing here is that this is for o1,
         | and you are supplying questions & answers, but _not_ the entire
         | solution-solving transcript (as you almost never have such a
         | thing). The whole point of o1 is that you _don 't_ simply train
         | on the supervised pairs that the users will be supplying here,
         | because it's so hard to simply leap straight from a question to
         | a correct answer, without doing additional work in between. (OA
         | already offers a finetuning service like that, note.)
         | 
         | So DPO vs RLHF is missing the point: the interesting thing here
         | is how they are (presumably) generating the inner-monologue to
         | fill in the gap between the Q and the A that you provide them,
         | and then training on _that_ augmented dataset of Q- >solving->A
         | datapoints.
         | 
         | Whether they are using simple finetuning on that dataset, or
         | DPO, or RLHF, or something else, seems less interesting than
         | the broader questions of, "does that work? and are there many
         | important or economically datasets where o1 can 'fill in the
         | gaps', creating a better annotated dataset, and bootstrap
         | itself to be much more intelligent on that dataset?"
        
       | throwup238 wrote:
       | This was announced as part of their second day of "12 Days of
       | AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc
        
         | echelon wrote:
         | They're searching for enterprise customers before they become a
         | commodity.
        
           | talldayo wrote:
           | This was obvious even before the Microsoft deal got penned.
        
       | thorum wrote:
       | Clever way to get more training data.
        
         | j_maffe wrote:
         | Yeah I was gonna say this would normally be paid for. They're
         | profiting off of the hype.
        
           | m3kw9 wrote:
           | You didn't even use it yet why bash it?
        
             | krainboltgreene wrote:
             | Analysis isn't criticism.
        
         | SheinhardtWigCo wrote:
         | Even just the submissions to this application form will be
         | highly insightful.
        
         | turingfeel wrote:
         | Can't you opt out? I'd even wager by default they don't retain
         | this data for in-house training, especially at enterprise.
        
           | disgruntledphd2 wrote:
           | The last question asks if you'll share data, and says that
           | they'll prioritise those that do.
        
       | ausbah wrote:
       | this sounds like expert systems 2.0 lol
        
         | meltyness wrote:
         | I assume it's more like scaled nlp, which sort of describes the
         | whole thing to begin with. i suspect it will boil down to
         | further generalizing nlp-in-the-loop algorithms, more Tools,
         | Tools between Tools, presumably Expert mixtures or randomly
         | selecting "Axioms" and having an expert forget one and seeing
         | if what remains makes sense still as the Tools are operated,
         | and how that can be encoded better across domains.
         | 
         | It's not nothing, but there's a lot of value stuck up in there,
         | I mean, it's made out of people.
         | 
         | Real special, takes a lot of smart
        
       | amelius wrote:
       | Is there any piece I can read that gives an overview of the ways
       | in which modern LLM networks are trained and optimized?
        
         | popol1991 wrote:
         | Checkout the TULU3 report from AI2:
         | https://arxiv.org/pdf/2411.15124
        
           | amelius wrote:
           | Thanks!
        
       | lmeyerov wrote:
       | For security & fraud teams who want to 'own their AI' vs trust
       | with Sam Altman, we are doing some fun things here as part of
       | louie.ai, and looking for our next cohort of
       | Splunk/databricks/elastic/neo4j/etc teams. LMK or signup on
       | louie.ai -- I do agree with the direction openai is going, but as
       | always, devil is in the details, and especially for serious
       | problems on sensitive data.
        
       | CaptRon wrote:
       | Are alignment and fine-tuning just a parallel of education?
        
         | dr_kiszonka wrote:
         | Alignment is more akin to indoctrination because education, in
         | theory, makes you smarter and more open-minded.
        
       | radarsat1 wrote:
       | If I'd like to learn more about DPO and RLHF, I've been looking
       | for toy problems/datasets to use but coming up a bit empty
       | handed. Is there a convenient way to experiment with these
       | methods through toy problems and simulation that can be done on a
       | single GPU? The need for massive data and parameter counts to do
       | anything interesting makes learning about these methods a little
       | daunting.
        
         | roborovskis wrote:
         | https://stable-baselines3.readthedocs.io/en/master/ is a great
         | resource for hacking on implementations for RL - many good RL
         | courses out there but
         | https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9Rdm...
         | is my personal favorite.
         | 
         | For LLMs / RLHF it's a little more difficult but
         | https://github.com/huggingface/alignment-handbook and the
         | Zephyr project is a good collection of model / dataset / script
         | that is easy to follow.
         | 
         | I would suggest studying the basics of RL first before diving
         | into LLM RLHF, which is much harder to learn on a single GPU.
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:00 UTC)