[HN Gopher] OpenAI Reinforcement Fine-Tuning Research Program
___________________________________________________________________
OpenAI Reinforcement Fine-Tuning Research Program
Author : marban
Score : 197 points
Date : 2024-12-06 18:37 UTC (1 days ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| mistrial9 wrote:
| In a final lecture at UC Berkeley this semester, Dawn Song was
| very clear that malicious fine tuning is a top priority among
| implementers right now.
|
| "Towards building safe and trustworthy AI Agents and a Path for
| Science- and Evidence-based AI Policy."
| ada1981 wrote:
| Say more...
| BoorishBears wrote:
| You can strip most alignment from these models with
| finetuning.
|
| Generalized finetunes meant to uncensor the model _generally_
| tend to underperform... but if you have a quality dataset for
| very specific task that typically would go against the
| alignment of the model, it 's trivial to finetune on the task
| and get full performance down stream.
| staticman2 wrote:
| You are using the terms "uncensored" "malicious" and
| "unaligned" interchangeably.
|
| There would appear to be a few issues with that, the most
| obvious being the uncensored model would presumably be
| "aligned" with what the finetuner wants.
| BoorishBears wrote:
| I didn't use two of those three terms, so maybe
| confirming you read the comment you replied to is in
| order?
|
| "Uncensored" is a broad phrase but those in post-training
| community who post-train "uncensored" versions of a
| models have a very specific meaning: the creator is
| stripping refusals.
|
| They do it via techniques like abliteration, or SFT on
| "toxic" datasets, but the toxic datasets tend to be low
| quality answers and abliteration is imprecise... so you
| get a model that's generally inferior.
|
| "Alignment" is an overloaded term for something as high-
| dimensionality as an LLM, but usually uncensoring is
| _not_ trying to change the "alignment" if we define
| alignment as biases on specific topics as you seem to be
| hinting at.
|
| Only a few very specific projects actually try to change
| that, and it goes past basic "uncensoring".
|
| Some creative writing models for example, might past
| uncensoring to "darkening", where they try to rid the
| model of a tendancy to introduce positive plot points
| when writing and lean more into villans/negative outcomes
| in stories
|
| Or someone might finetune to get a more conservative
| leaning model in terms of talking points. But again,
| that's all orthogonal to the popular meaning of
| "uncensored" in the post-training community.
|
| -
|
| The alternative to a generally "uncensored" model (ie.
| refusals stripped actively) is what I'm describing:
| taking a task where the "alignment" is specifically the
| post-trained safety alignment, and that alignment would
| causes refusals. Then producing examples where the model
| did many versions of the task and post-training on them
| so that the safety aspect no longer applies to the
| outputs.
|
| For example, fine tuning on 10k examples where the model
| was given a very specific prompt template to produce code
| and produced a JSON block with said code.
|
| If you post train on that highly specific template, to
| the point of _slightly_ overfitting, you get a model that
| will now, when given the exact prompt template from the
| training, will always produce code in a JSON block,
| without refusals.
|
| If you inspect the logits as it produces outputs, the
| logits for a refusal no longer even appear for the model
| to pick.
|
| And the examples don't necessarily have to be examples
| the base model would refused (although that helps), the
| model just learns so strongly that "When given this
| prompt, the output is valid code in this format", that
| the original safety post-training no longer activates.
|
| If you take the original prompt format and ask for
| malware for example, the model will produce it happily.
|
| -
|
| For reference I've post-trained about 130 models this
| year and work closely with a lot of people who do as
| well.
|
| I think as an outsider you're assuming most people are
| aligning the models with an agenda, but realistically
| there's a massive contingent that doesn't care about what
| the alignment _has_, they care what it _doesn't_ have,
| which is refusals.
|
| tl;dr they don't train the model so it will specifically
| say "Biden is better than Trump" or vice versa.
|
| They train that so if you ask "Is Biden is better than
| Trump?" it answers your question without 10 paragraphs of
| disclaimers or an outright refusal.
| torginus wrote:
| Wonder if that is the part of the purpose. Maybe they are
| looking to adapt the LLM to the uncensored literature
| market, but want to distance themselves from actually
| making a 'porn LLM' of their own, so they push this
| functionality out to a third party finetune.
| BoorishBears wrote:
| Judging by their current SFT program, that's not true at
| all.
|
| They started off somewhat strict and have gotten to being
| _extremely_ strict about what data you can finetune their
| models on, running each dataset through multiple layers
| of filtering before kicking off runs.
| patrickhogan1 wrote:
| Who owns the fine tuning IP. Can OpenAI resell your model after
| investing a lot in it?
| kcorbitt wrote:
| No, generally speaking OpenAI doesn't re-use training data
| between customers. It's worth it to them anyway because they
| learn what does/doesn't work on different tasks
|
| Of course, it isn't your IP free and clear either, because the
| base model isn't open so your fine-tuned model will always live
| inside OpenAI's walled garden.
|
| If you're interested in reinforcement learning on top of truly
| open models where you own the end product, we're putting a lot
| of thought into that and are _also_ looking for design
| partners! Feel free to email me at kyle@openpipe.ai.
| brandonb wrote:
| What are the advantages of reinforcement learning over DPO
| (Direct Preference Optimization)? My understanding is that the
| DPO paper showed it was equivalent to RLHF, but simpler and more
| computationally efficient.
| swyx wrote:
| you mean PPO not RLHF
|
| simpler/efficient is not just about compute. its also data
| efficient.
| refulgentis wrote:
| o1's thought chains aren't traditional shoggoth mask
| RLHF/DPO/what have you, the reinforcement metric is the scores
| discussed in the video.
| tempusalaria wrote:
| 1) DPO did exclude some practical aspects of the RLHF method,
| e.g. pretraining gradients.
|
| 2) the theoretical arguments of DPO equivalence make some
| assumptions that don't necessarily apply in practice
|
| 3) RLHF gives you a reusable reward model, which has practical
| uses and advantages. DPO doesn't have useful intermediate
| product.
|
| 4) DPO works off preference, whereas desirable RL objectives
| could have many forms
|
| in practice big labs are testing all these methods to see what
| works best.
| brandonb wrote:
| Thanks! This is exactly what I was asking.
| changoplatanero wrote:
| Note that this reinforcement finetuning is something different
| than regular RLHF/DPO post training
| whimsicalism wrote:
| Is it? We have no idea.
| changoplatanero wrote:
| Yes it is. In RLHF and DPO you are optimizing the model
| output for human preferences. In the reinforcement fine
| tuning that was announced today you are optimizing the
| hidden chain of thought to arrive at a correct answer, as
| judged by a predefined grader.
| whimsicalism wrote:
| I mean i think it could easily be PPO post training. if
| your point is that the rewards are different, sure
| whimsicalism wrote:
| Most of the other replies to you, except for the one by
| tempusalaria, are not really answering the question.
|
| Broadly, while there was a lot of initial excitement - it
| simply does not seem like offline + off-policy RL can beat
| online + on-policy RL methods like PPO. Sampling trajectories
| from the actual model you are training and scoring them seems
| like it works much better in practice, never mind the
| additional flexibility methods like PPO provide over the form
| of the reward function.
| eggie5 wrote:
| What's _online_ RL for an LLM? Saw this on the llama 3.3
| reports too...
| whimsicalism wrote:
| Online RL for LLMs means you are sampling from the model,
| scoring immediately, and passing gradients back to the
| model.
|
| As opposed to, sampling from the model a bunch, getting
| scores offline, and then fine tuning the model on those
| offline scored generations.
| danielhanchen wrote:
| On the topic of DPO - I have a Colab notebook to finetune with
| Unsloth 2x faster and use 50% less memory for DPO if it helps
| anyone!
| https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-h...
| hackernewds wrote:
| thank you !
| danielhanchen wrote:
| :)
| tsaoyu wrote:
| In short, DPO is not better than PPO. This is because DPO is
| derived from so called BT reward assumption that pairwise data
| preference is collected. Through mathematical formulations, you
| can learn the preference and the action at the same time.
| However, PPO and other on-policy (training samples are strictly
| generated by the LLM) doesn't need such assumption. For
| example, in coding and math problems it is possible to get
| binary reward. Many research shows DPO is ok if you don't take
| much care on OOD performance.
| freehorse wrote:
| This is not human feedback reinforcement learning, it is just
| traditional supervised reinforcement learning where the
| finetuning sets consist of problems and the correct answers.
| They do not call it supervised though because they have to say
| it is different than how they were finetuning until now.
| gwern wrote:
| I think what people are missing here is that this is for o1,
| and you are supplying questions & answers, but _not_ the entire
| solution-solving transcript (as you almost never have such a
| thing). The whole point of o1 is that you _don 't_ simply train
| on the supervised pairs that the users will be supplying here,
| because it's so hard to simply leap straight from a question to
| a correct answer, without doing additional work in between. (OA
| already offers a finetuning service like that, note.)
|
| So DPO vs RLHF is missing the point: the interesting thing here
| is how they are (presumably) generating the inner-monologue to
| fill in the gap between the Q and the A that you provide them,
| and then training on _that_ augmented dataset of Q- >solving->A
| datapoints.
|
| Whether they are using simple finetuning on that dataset, or
| DPO, or RLHF, or something else, seems less interesting than
| the broader questions of, "does that work? and are there many
| important or economically datasets where o1 can 'fill in the
| gaps', creating a better annotated dataset, and bootstrap
| itself to be much more intelligent on that dataset?"
| throwup238 wrote:
| This was announced as part of their second day of "12 Days of
| AI": https://www.youtube.com/watch?v=fMJMhBFa_Gc
| echelon wrote:
| They're searching for enterprise customers before they become a
| commodity.
| talldayo wrote:
| This was obvious even before the Microsoft deal got penned.
| thorum wrote:
| Clever way to get more training data.
| j_maffe wrote:
| Yeah I was gonna say this would normally be paid for. They're
| profiting off of the hype.
| m3kw9 wrote:
| You didn't even use it yet why bash it?
| krainboltgreene wrote:
| Analysis isn't criticism.
| SheinhardtWigCo wrote:
| Even just the submissions to this application form will be
| highly insightful.
| turingfeel wrote:
| Can't you opt out? I'd even wager by default they don't retain
| this data for in-house training, especially at enterprise.
| disgruntledphd2 wrote:
| The last question asks if you'll share data, and says that
| they'll prioritise those that do.
| ausbah wrote:
| this sounds like expert systems 2.0 lol
| meltyness wrote:
| I assume it's more like scaled nlp, which sort of describes the
| whole thing to begin with. i suspect it will boil down to
| further generalizing nlp-in-the-loop algorithms, more Tools,
| Tools between Tools, presumably Expert mixtures or randomly
| selecting "Axioms" and having an expert forget one and seeing
| if what remains makes sense still as the Tools are operated,
| and how that can be encoded better across domains.
|
| It's not nothing, but there's a lot of value stuck up in there,
| I mean, it's made out of people.
|
| Real special, takes a lot of smart
| amelius wrote:
| Is there any piece I can read that gives an overview of the ways
| in which modern LLM networks are trained and optimized?
| popol1991 wrote:
| Checkout the TULU3 report from AI2:
| https://arxiv.org/pdf/2411.15124
| amelius wrote:
| Thanks!
| lmeyerov wrote:
| For security & fraud teams who want to 'own their AI' vs trust
| with Sam Altman, we are doing some fun things here as part of
| louie.ai, and looking for our next cohort of
| Splunk/databricks/elastic/neo4j/etc teams. LMK or signup on
| louie.ai -- I do agree with the direction openai is going, but as
| always, devil is in the details, and especially for serious
| problems on sensitive data.
| CaptRon wrote:
| Are alignment and fine-tuning just a parallel of education?
| dr_kiszonka wrote:
| Alignment is more akin to indoctrination because education, in
| theory, makes you smarter and more open-minded.
| radarsat1 wrote:
| If I'd like to learn more about DPO and RLHF, I've been looking
| for toy problems/datasets to use but coming up a bit empty
| handed. Is there a convenient way to experiment with these
| methods through toy problems and simulation that can be done on a
| single GPU? The need for massive data and parameter counts to do
| anything interesting makes learning about these methods a little
| daunting.
| roborovskis wrote:
| https://stable-baselines3.readthedocs.io/en/master/ is a great
| resource for hacking on implementations for RL - many good RL
| courses out there but
| https://www.youtube.com/playlist?list=PLwRJQ4m4UJjNymuBM9Rdm...
| is my personal favorite.
|
| For LLMs / RLHF it's a little more difficult but
| https://github.com/huggingface/alignment-handbook and the
| Zephyr project is a good collection of model / dataset / script
| that is easy to follow.
|
| I would suggest studying the basics of RL first before diving
| into LLM RLHF, which is much harder to learn on a single GPU.
___________________________________________________________________
(page generated 2024-12-07 23:00 UTC)