[HN Gopher] RLHF a LLM in <50 lines of Python
___________________________________________________________________
RLHF a LLM in <50 lines of Python
Author : patelajay285
Score : 135 points
Date : 2024-02-11 15:12 UTC (7 hours ago)
(HTM) web link (datadreamer.dev)
(TXT) w3m dump (datadreamer.dev)
| jerpint wrote:
| I don't understand the obsession of LOC for wrappers - it's the
| whole point of a wrapper. It makes it much easier for the user at
| the expense of making it less hackable
|
| Title should be instead "Library for low-code RLHF in python"
| patelajay285 wrote:
| This is developed for researchers, so I assure it's very
| hackable and configurable. ;-) but appreciate the feedback on
| the title!
| brigadier132 wrote:
| I always appreciate these projects because I just dive into the
| code itself and copy out what I need once the wrapper becomes
| too much of a burden.
| patelajay285 wrote:
| That's totally valid and something we would even encourage!
| This project is for researchers so if there is a point where
| the abstraction is no longer useful, by all means configure,
| or subclass, or copy code.
| rovr138 wrote:
| Of course. And they're not saying they don't have a place.
|
| They're saying why does it matter if it's 50 vs 60 or even
| 100. It's a wrapper, which should be less lines. That's the
| whole point. Abstracting things even further and making
| assumptions.
|
| Of course you can use them. Of course you can remove them
| after and use the underlying code. But the LOC shouldn't be
| the important part of it
| behnamoh wrote:
| This. If I'm the type of person who wants to do RLHF, then I'm
| the type of person who wants control and doesn't like
| delegating it to imported libraries.
| patelajay285 wrote:
| This is built for ML researchers out of an academic lab.
| There's a ton of functionality in the library (beyond RLHF
| and alignment) that ML researchers do every day to write
| papers and run experiments that the library helps abstract
| and make repeatable and usable.
|
| Unless your research hypothesis is specifically around
| improving or changing RLHF, it's unlikely you should be
| implementing it from scratch. Abstractions are useful for a
| reason. The library is quite configurable to let you tune any
| knobs you would want.
| verticalscaler wrote:
| Yes you do. Most casuals are downright afraid of code. This
| messaging is meant to make the project more approachable.
|
| Kind of like everybody knows the pop-science around e = mc^2
| but most are completely oblivious that it takes a bunch of
| whiteboards to derive it and what all that actually means.
|
| No pithy formula no way for the actual ideas to spread to the
| mainstream for you to somehow hear about it.
| antonvs wrote:
| This reminds me of the advice Stephen Hawking's publisher
| gave him, which was that every equation he included in his
| book, A Brief History of Time, would cut the sales of the
| book in half. As a result the only equation that ended up in
| the book was E=mc^2.
| vvrm wrote:
| Another problem with the title: the article is about DPO, which
| doesn't do reinforcement learning. So not RLHF. I guess RLHF
| has more of a name recognition than DPO.
| patelajay285 wrote:
| This was discussed in another comment, DPO is pretty much
| strictly better than RLHF + PPO, and far more stable when
| training. Yes, DPO is not technically "RL", but it's
| semantics for the most part. DataDreamer does support PPO
| training if you want, but it's so unstable, it's a less
| popular choice now.
| antonvs wrote:
| In the DPO paper linked from the OP page, DPO is described
| as "a simple RL-free algorithm for training language models
| from preferences." So as you say, "not technically RL."
|
| Given that, shouldn't the first sentence on the linked page
| end with "...in a process known as DPO (...)" ? Ditto for
| the title.
|
| It sounds like you're saying that the terms RL and RLHF
| should subsume DPO because they both solve the same
| problem, with similar results. But they're different
| techniques, and there are established terms for both of
| them.
| patelajay285 wrote:
| I think the discussion in the other comment thread
| discusses this well. They are different techniques, but
| the line between RL & SL is quite fuzzy. The DPO authors
| advertise this as a "non-RL" technique to precisely get
| away from the reputation of unstable training RL has, but
| they also say and treat the language model as an
| (implicit) reward model, similar to PPO. The point is
| well taken though, I will update this page to clarify the
| differences to avoid confusion.
| vvrm wrote:
| > DPO is pretty much strictly better than RLHF + PPO
|
| Out of genuine curiosity, do you have any pointers/evidence
| to support this. I know that some of the industry leading
| research labs haven't switched over to DPO yet, in spite of
| the fact that DPO is significantly faster than RLHF. It
| might just be organizational inertia, but I do not know. I
| would be very happy if simpler alternatives like DPO were
| as good as RLHF or better, but I haven't seen that proof
| yet.
| tgsovlerkhgsel wrote:
| Honestly the amount of complicated boilerplate that you're
| supposed to write from scratch every time you do something with
| ML (in some of the major frameworks) deterred me from touching
| anything ML-related for a long time.
|
| As far as I understand, what the training loop is supposed to
| be doing is pretty static and you don't need to understand most
| of it in order to "do ML", but at the same time it's full of
| complicated things to get right (which would be much easier to
| understand when controlled through well defined parameters
| instead of mixing boilerplate and config).
| patelajay285 wrote:
| Hi everyone, there are no easy tools for synthetic data
| generation or training and aligning LLMs simply in Python. Most
| of the stuff out there are messy adhoc scripts.
|
| DataDreamer is an open source Python package with a nice API from
| the University of Pennsylvania that does all this that we're
| actively developing. Will be here to answer questions.
|
| https://github.com/datadreamer-dev/DataDreamer
| baggiponte wrote:
| The API looks nice, congratulations. Will experiment with it.
| One small silly question: why did you choose to specify the
| dependencies inside the src dir with the requirements format -
| rather than inside the pyproject?
| patelajay285 wrote:
| Thanks! It makes it easier to run with the existing run
| scripts I have on our large university GPU cluster. :) no
| other reason
| g4zj wrote:
| Very cool, but I can't help but feel like titles that reference
| low-LOC are a bit clickbait-y when nearly all the heavy lifting
| is done by imported libraries.
| patelajay285 wrote:
| Appreciate the feedback on the title, this is developed for ML
| researchers, so I assure there is a lot it's doing under the
| hood to make this process easier (for example introducing
| automatic caching and resumability).
|
| However, we also tried to simplify the API and have sensible
| defaults to make it usable for anyone / make ML research code
| cleaner :)
| imjonse wrote:
| The first paragraphs says RLHF can be used to align models, and
| the seconds say here's how to do it by using DPO. These two
| methods are not the same, and the latter is not an instance of
| the former.
| Der_Einzige wrote:
| The latter is strictly superior to the former though. RlHF has
| been abandoned in the open source world.
| imjonse wrote:
| I am just saying the intro paragraphs are confusing.
| patelajay285 wrote:
| Thanks, appreciate the feedback, will update when I get a
| chance!
| patelajay285 wrote:
| Yep, DPO is not technically "RL" and implicitly uses the LLM
| itself as a reward model, but training with DPO is far more
| stable for that reason.
| espadrine wrote:
| DPO is as close to RL as RLHF. The latter also uses the LLM
| as a reward model.
|
| I'm not a fan of the RL/SL dichotomy, because the line gets
| so foggy. If you squint, every loss is a negative reward,
| and every policy improvement a supervised target.
|
| Still, what the code does isn't what is described in the
| paper that the page links to.
| nextaccountic wrote:
| > I'm not a fan of the RL/SL dichotomy, because the line
| gets so foggy. If you squint, every loss is a negative
| reward, and every policy improvement a supervised target.
|
| Isn't this just because reinforcement learning and
| supervised learning are both optimization problems?
| espadrine wrote:
| In part, yes! But also because what used to define it was
| the human-curated datasets: SL contained input/output
| pairs, while RL contained episodes with sporadic rewards.
|
| Nowadays, many datasets have different forms or are
| synthetic. DPO uses datasets with both positive and
| negative examples (instead of just a target output as
| with traditional SL); RLHF uses synthetic rewards.
| patelajay285 wrote:
| I tend to agree @espadrine, it's semantics for the most
| part
| patelajay285 wrote:
| Fair, DPO is considered a fairly well established technique now
| that is far more stable in training than PPO, but also helps
| align LLMs from human feedback. The package also helps do PPO,
| so you can do traditional RLHF, but figured more people would
| be interested in seeing a DPO example, given how unstable PPO
| is.
| lopkeny12ko wrote:
| It's not 50 lines of code if all the real work is done by
| importing a library...
|
| That's like saying, I can solve any problem in 2 lines of code.
| I'll publish a library for it first, then:
|
| import foo; foo.do_the_thing()
|
| Magic!
| antonvs wrote:
| Software developers hate this one simple trick!
| peab wrote:
| did people say the same thing when assembly code got abstracted
| away?
| skelpmargyar wrote:
| Importing a library is not abstraction any more than closing
| your eyes is abstracting the world to black.
| mk_stjames wrote:
| I feel the preparation and loading of the dataset has been
| abstracted too far away. I have no idea what type of data format
| I need or how it is loaded for this (it is using a pre-prepared
| huggingface dataset?). If I have local data how should it be
| loaded? What does that even look like? Is it expecting some sort
| of JSON?
|
| When you get so far as to abstracting every step to loading a
| one-liner from huggingface, including the downloading of a
| prepared dataset with no example of doing the same on custom
| local dataset, you've abstracted too far to be useful for anyone
| other than the first user.
| patelajay285 wrote:
| Thanks for the question. This is built for ML researchers, so
| in examples we use the defacto source for datasets researchers
| often use, HF Hub.
|
| However, there is a lot of documentation on the site to help
| guide users. This documentation page shows you can load in data
| via local datasets as well. For example, JSON, CSV, text files,
| a local HF Dataset folder, or even from a Python `dict` or
| `list`:
|
| https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...
|
| We'll definitely keep improving documentation, guides, and
| examples. We have a lot of it already, and more to come! This
| has only recently become a public project :)
|
| If anyone has any questions on using it, feel free to email me
| directly (email on the site and HN bio) for help in the
| meantime.
| mk_stjames wrote:
| I did glance at the docs first before commenting but I was
| looking in 'datasets' to try and understand importing a
| potential CSV/JOSN etc and all I saw was verbage on accessing
| the output.
|
| I would not have guessed that the base input data processing
| would have been filed under 'steps'. But now I kinda see how
| you are working, but I admit I'm not the target audience.
|
| If you want this to really take off for people outside of a
| very, very specific class of researchers... setup an example
| on your landing page that calls to a local JSON of user
| prompts/answers/rejects finetuning a llama model with your
| datadreamer.steps.JSONDataSource into the loader. Or, a txt
| file with the system/user/assistant prompts tagged and
| examples given. Yes, your 'lines of code' for your frontpage
| example may grow a bit!
|
| Maybe there are a lot of 'ML researchers' that are used to
| the type of super-abstract OOP API, load-it-from-huggingface-
| scheme-people you are targeting but also know that there are
| a ton that aren't.
| patelajay285 wrote:
| That's totally fair and good feedback, it's hard to support
| everyone's use cases simultaneously, but from my own
| research and other researchers we collaborate with, this
| solves and streamlines the right set of problems, but we
| want to make this as broadly useful as possible. Always
| happy to chat more / provide support if you would like,
| feel free to reach out if you want to try it and run into
| any sharp edges I could help make easier.
| bbstats wrote:
| I can abstract this to 2 lines
| MrYellowP wrote:
| I don't prefer aligned models and I'm a human. It's not okay to
| claim that that's what humans prefer. There might be a subset of
| humans who can't handle words, but they're not even remotely in
| the majority.
|
| Algined models are dumber, treat everyone like they're stupid
| immature idiots who can't handle words and they're a wannabe
| moral authority.
| spdustin wrote:
| It occurs to me that there must be a model that's been "aligned"
| opposite to the usual RLHF. Or has nobody done that?
| proto-n wrote:
| Yeah well in bash I can do it in one line: `python train.py`. I
| hate examples like this, the 50loc statement is totally useless
| (and so is the code example, as I can't learning anything from
| it).
| rrr_oh_man wrote:
| RLHF = Reinforcement Learning from Human Feedback
| potatoman22 wrote:
| This seems useful, thanks!
| theptip wrote:
| Interested if local RLHF is actually viable; can you get
| meaningful steering from 1k feedback points on a narrow task? I
| feel that annotation count is achievable with a single dedicated
| annotator making a few comments per minute (though tedious), 10k
| would be a week of work so achievable for a very dedicated
| hobbyist, and 100k seems out of reach for a hobby project.
|
| Say for simple conversation usecases (eg customer support for a
| specific product, interactive fiction, things like that without
| deep technical knowledge).
|
| I was also wondering if it's possible to do such RLHF for SD
| running locally.
| ilaksh wrote:
| How do you normally do DPO? Is that built in to PyTorch or
| something?
|
| Theoretically the hard part is collecting the examples with
| rejections etc.
| patelajay285 wrote:
| Collecting data is hard, but the library is also a synthetic
| data generation library, so for example you can create the data
| for DPO fully synthetically, check out the self-rewarding LLMs
| example:
| https://datadreamer.dev/docs/latest/pages/get_started/quick_...
| aethelyon wrote:
| This is cool, but the data collection is the hard part, right?
| patelajay285 wrote:
| Yes it is :), but the library is also a synthetic data
| generation library, so for example you can create the data for
| DPO fully synthetically, check out the self-rewarding LLMs
| example:
|
| https://datadreamer.dev/docs/latest/pages/get_started/quick_...
| sillysaurusx wrote:
| I'm extremely skeptical of this approach. Until proven
| otherwise, with a model that users actually find useful, I
| don't think this can work.
|
| It would be nice. But I've seen too many nice ideas
| completely fall apart in practice to accept this without some
| justification. Even if there are papers on the topic, and
| those papers show that the models rank highly according to
| some eval metrics, the only metric that truly matters is "the
| user likes the model and it solves their problems."
|
| By the way, on a separate topic, the 90/10 dataset split that
| you do in all of your examples turns out to be fraught with
| peril in practice. The issue is that the validation dataset
| _quality_ turns out to be crucial, and randomly yeeting 10%
| of your data into the validation dataset without manual
| review is a recipe for problems.
| patelajay285 wrote:
| It's a demo snippet of how to setup the workflow, it's not
| meant to be a working production example a self-rewarding
| model or a faithful reproduction of the original paper.
| Whether self-rewarding LLMs are a good idea or not, it's a
| valuable and very active area of research in the literature
| today. This is a library for ML researchers who should
| actively research and study these avenues along with the
| pitfalls you're mentioning. But in order for them to do
| that, building these workflows have to be accessible to
| them, which is what this library is meant to do. It's not
| meant for the "hobbyist" ML-community, they should not be
| using synthetic data today in this way, it would likely
| lead to subpar results for any practical model or task.
___________________________________________________________________
(page generated 2024-02-11 23:00 UTC)