[HN Gopher] OpenAI Reinforcement Fine-Tuning Research Program
___________________________________________________________________
OpenAI Reinforcement Fine-Tuning Research Program
Author : marban
Score : 96 points
Date : 2024-12-06 18:37 UTC (4 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| mistrial9 wrote:
| In a final lecture at UC Berkeley this semester, Dawn Song was
| very clear that malicious fine tuning is a top priority among
| implementers right now.
|
| "Towards building safe and trustworthy AI Agents and a Path for
| Science- and Evidence-based AI Policy."
| ada1981 wrote:
| Say more...
| BoorishBears wrote:
| You can strip most alignment from these models with
| finetuning.
|
| Generalized finetunes meant to uncensor the model _generally_
| tend to underperform... but if you have a quality dataset for
| very specific task that typically would go against the
| alignment of the model, it 's trivial to finetune on the task
| and get full performance down stream.
| staticman2 wrote:
| You are using the terms "uncensored" "malicious" and
| "unaligned" interchangeably.
|
| There would appear to be a few issues with that, the most
| obvious being the uncensored model would presumably be
| "aligned" with what the finetuner wants.
| BoorishBears wrote:
| I didn't use two of those three terms, so maybe
| confirming you read the comment you replied to is in
| order?
|
| "Uncensored" is a broad phrase but those in post-training
| community who post-train "uncensored" versions of a
| models have a very specific meaning: the creator is
| stripping refusals.
|
| They do it via techniques like abliteration, or SFT on
| "toxic" datasets, but the toxic datasets tend to be low
| quality answers and abliteration is imprecise... so you
| get a model that's generally inferior.
|
| "Alignment" is an overloaded term for something as high-
| dimensionality as an LLM, but usually uncensoring is
| _not_ trying to change the "alignment" if we define
| alignment as biases on specific topics as you seem to be
| hinting at.
|
| Only a few very specific projects actually try to change
| that, and it goes past basic "uncensoring".
|
| Some creative writing models for example, might past
| uncensoring to "darkening", where they try to rid the
| model of a tendancy to introduce positive plot points
| when writing and lean more into villans/negative outcomes
| in stories
|
| Or someone might finetune to get a more conservative
| leaning model in terms of talking points. But again,
| that's all orthogonal to the popular meaning of
| "uncensored" in the post-training community.
|
| -
|
| The alternative to a generally "uncensored" model (ie.
| refusals stripped actively) is what I'm describing:
| taking a task where the "alignment", specifically the
| safety alignment, would causes refusals. Then producing
| examples where the model did many versions of the task
| and post-training on them.
|
| For example, fine tuning on 10k examples where the model
| was given a very specific prompt template to produce code
| and produced a JSON block with said code.
|
| If you post train on that highly specific template, to
| the point of _slightly_ overfitting, you get a model that
| will now, when given the exact prompt template from the
| training, will always produce code in a JSON block,
| without refusals.
|
| If you inspect the logits it produces, the logits for a
| refusal no longer appear for the model to pick.
|
| The examples don't even have to be examples the base
| model would refuse, the model just learns so strongly
| that "When given this prompt, the output is valid code in
| this format", that the original safety post-training no
| longer activates.
|
| If you take the original prompt format and ask for
| malware for example, the model will produce it happily.
|
| -
|
| For reference I've post-trained about 130 models this
| year and work closely with a lot of people who do as
| well.
|
| I think as an outsider you're assuming most people are
| aligning the models with an agenda, but realistically
| there's a massive contingent that doesn't care about what
| the alignment _has_, they care what it _doesn't_ have,
| which is refusals.
|
| tl;dr they don't train the model so it will specifically
| say "Biden is better than Trump" or vice versa.
|
| They train that so if you ask "Is Biden is better than
| Trump?" it answers your question without 10 paragraphs of
| disclaimers or an outright refusal.
| patrickhogan1 wrote:
| Who owns the fine tuning IP. Can OpenAI resell your model after
| investing a lot in it?
| kcorbitt wrote:
| No, generally speaking OpenAI doesn't re-use training data
| between customers. It's worth it to them anyway because they
| learn what does/doesn't work on different tasks
|
| Of course, it isn't your IP free and clear either, because the
| base model isn't open so your fine-tuned model will always live
| inside OpenAI's walled garden.
|
| If you're interested in reinforcement learning on top of truly
| open models where you own the end product, we're putting a lot
| of thought into that and are _also_ looking for design
| partners! Feel free to email me at kyle@openpipe.ai.
| brandonb wrote:
| What are the advantages of reinforcement learning over DPO
| (Direct Preference Optimization)? My understanding is that the
| DPO paper showed it was equivalent to RLHF, but simpler and more
| computationally efficient.
| swyx wrote:
| you mean PPO not RLHF
|
| simpler/efficient is not just about compute. its also data
| efficient.
| refulgentis wrote:
| o1's thought chains aren't traditional shoggoth mask
| RLHF/DPO/what have you, the reinforcement metric is the scores
| discussed in the video.
| tempusalaria wrote:
| 1) DPO did exclude some practical aspects of the RLHF method,
| e.g. pretraining gradients.
|
| 2) the theoretical arguments of DPO equivalence make some
| assumptions that don't necessarily apply in practice
|
| 3) RLHF gives you a reusable reward model, which has practical
| uses and advantages. DPO doesn't have useful intermediate
| product.
|
| 4) DPO works off preference, whereas desirable RL objectives
| could have many forms
|
| in practice big labs are testing all these methods to see what
| works best.
| brandonb wrote:
| Thanks! This is exactly what I was asking.
| changoplatanero wrote:
| Note that this reinforcement finetuning is something different
| than regular RLHF/DPO post training
| whimsicalism wrote:
| Is it? We have no idea.
| whimsicalism wrote:
| Most of the other replies to you, except for the one by
| tempusalaria, are not really answering the question.
|
| Broadly, while there was a lot of initial excitement - it
| simply does not seem like offline + off-policy RL can beat
| online + on-policy RL methods like PPO. Sampling trajectories
| from the actual model you are training and scoring them seems
| like it works much better in practice, never mind the
| additional flexibility methods like PPO provide over the form
| of the reward function.
| danielhanchen wrote:
| On the topic of DPO - I have a Colab notebook to finetune with
| Unsloth 2x faster and use 50% less memory for DPO if it helps
| anyone!
| https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
| throwup238 wrote:
| This was announced as part of their second day of "12 Days of
| AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc
| thorum wrote:
| Clever way to get more training data.
| j_maffe wrote:
| Yeah I was gonna say this would normally be paid for. They're
| profiting off of the hype.
| SheinhardtWigCo wrote:
| Even just the submissions to this application form will be
| highly insightful.
| ausbah wrote:
| this sounds like expert systems 2.0 lol
| amelius wrote:
| Is there any piece I can read that gives an overview of the ways
| in which modern LLM networks are trained and optimized?
| popol1991 wrote:
| Checkout the TULU3 report from AI2:
| https://arxiv.org/pdf/2411.15124
___________________________________________________________________
(page generated 2024-12-06 23:00 UTC)