[HN Gopher] RLHF a LLM in <50 lines of Python
       ___________________________________________________________________
        
       RLHF a LLM in <50 lines of Python
        
       Author : patelajay285
       Score  : 135 points
       Date   : 2024-02-11 15:12 UTC (7 hours ago)
        
 (HTM) web link (datadreamer.dev)
 (TXT) w3m dump (datadreamer.dev)
        
       | jerpint wrote:
       | I don't understand the obsession of LOC for wrappers - it's the
       | whole point of a wrapper. It makes it much easier for the user at
       | the expense of making it less hackable
       | 
       | Title should be instead "Library for low-code RLHF in python"
        
         | patelajay285 wrote:
         | This is developed for researchers, so I assure it's very
         | hackable and configurable. ;-) but appreciate the feedback on
         | the title!
        
         | brigadier132 wrote:
         | I always appreciate these projects because I just dive into the
         | code itself and copy out what I need once the wrapper becomes
         | too much of a burden.
        
           | patelajay285 wrote:
           | That's totally valid and something we would even encourage!
           | This project is for researchers so if there is a point where
           | the abstraction is no longer useful, by all means configure,
           | or subclass, or copy code.
        
           | rovr138 wrote:
           | Of course. And they're not saying they don't have a place.
           | 
           | They're saying why does it matter if it's 50 vs 60 or even
           | 100. It's a wrapper, which should be less lines. That's the
           | whole point. Abstracting things even further and making
           | assumptions.
           | 
           | Of course you can use them. Of course you can remove them
           | after and use the underlying code. But the LOC shouldn't be
           | the important part of it
        
         | behnamoh wrote:
         | This. If I'm the type of person who wants to do RLHF, then I'm
         | the type of person who wants control and doesn't like
         | delegating it to imported libraries.
        
           | patelajay285 wrote:
           | This is built for ML researchers out of an academic lab.
           | There's a ton of functionality in the library (beyond RLHF
           | and alignment) that ML researchers do every day to write
           | papers and run experiments that the library helps abstract
           | and make repeatable and usable.
           | 
           | Unless your research hypothesis is specifically around
           | improving or changing RLHF, it's unlikely you should be
           | implementing it from scratch. Abstractions are useful for a
           | reason. The library is quite configurable to let you tune any
           | knobs you would want.
        
         | verticalscaler wrote:
         | Yes you do. Most casuals are downright afraid of code. This
         | messaging is meant to make the project more approachable.
         | 
         | Kind of like everybody knows the pop-science around e = mc^2
         | but most are completely oblivious that it takes a bunch of
         | whiteboards to derive it and what all that actually means.
         | 
         | No pithy formula no way for the actual ideas to spread to the
         | mainstream for you to somehow hear about it.
        
           | antonvs wrote:
           | This reminds me of the advice Stephen Hawking's publisher
           | gave him, which was that every equation he included in his
           | book, A Brief History of Time, would cut the sales of the
           | book in half. As a result the only equation that ended up in
           | the book was E=mc^2.
        
         | vvrm wrote:
         | Another problem with the title: the article is about DPO, which
         | doesn't do reinforcement learning. So not RLHF. I guess RLHF
         | has more of a name recognition than DPO.
        
           | patelajay285 wrote:
           | This was discussed in another comment, DPO is pretty much
           | strictly better than RLHF + PPO, and far more stable when
           | training. Yes, DPO is not technically "RL", but it's
           | semantics for the most part. DataDreamer does support PPO
           | training if you want, but it's so unstable, it's a less
           | popular choice now.
        
             | antonvs wrote:
             | In the DPO paper linked from the OP page, DPO is described
             | as "a simple RL-free algorithm for training language models
             | from preferences." So as you say, "not technically RL."
             | 
             | Given that, shouldn't the first sentence on the linked page
             | end with "...in a process known as DPO (...)" ? Ditto for
             | the title.
             | 
             | It sounds like you're saying that the terms RL and RLHF
             | should subsume DPO because they both solve the same
             | problem, with similar results. But they're different
             | techniques, and there are established terms for both of
             | them.
        
               | patelajay285 wrote:
               | I think the discussion in the other comment thread
               | discusses this well. They are different techniques, but
               | the line between RL & SL is quite fuzzy. The DPO authors
               | advertise this as a "non-RL" technique to precisely get
               | away from the reputation of unstable training RL has, but
               | they also say and treat the language model as an
               | (implicit) reward model, similar to PPO. The point is
               | well taken though, I will update this page to clarify the
               | differences to avoid confusion.
        
             | vvrm wrote:
             | > DPO is pretty much strictly better than RLHF + PPO
             | 
             | Out of genuine curiosity, do you have any pointers/evidence
             | to support this. I know that some of the industry leading
             | research labs haven't switched over to DPO yet, in spite of
             | the fact that DPO is significantly faster than RLHF. It
             | might just be organizational inertia, but I do not know. I
             | would be very happy if simpler alternatives like DPO were
             | as good as RLHF or better, but I haven't seen that proof
             | yet.
        
         | tgsovlerkhgsel wrote:
         | Honestly the amount of complicated boilerplate that you're
         | supposed to write from scratch every time you do something with
         | ML (in some of the major frameworks) deterred me from touching
         | anything ML-related for a long time.
         | 
         | As far as I understand, what the training loop is supposed to
         | be doing is pretty static and you don't need to understand most
         | of it in order to "do ML", but at the same time it's full of
         | complicated things to get right (which would be much easier to
         | understand when controlled through well defined parameters
         | instead of mixing boilerplate and config).
        
       | patelajay285 wrote:
       | Hi everyone, there are no easy tools for synthetic data
       | generation or training and aligning LLMs simply in Python. Most
       | of the stuff out there are messy adhoc scripts.
       | 
       | DataDreamer is an open source Python package with a nice API from
       | the University of Pennsylvania that does all this that we're
       | actively developing. Will be here to answer questions.
       | 
       | https://github.com/datadreamer-dev/DataDreamer
        
         | baggiponte wrote:
         | The API looks nice, congratulations. Will experiment with it.
         | One small silly question: why did you choose to specify the
         | dependencies inside the src dir with the requirements format -
         | rather than inside the pyproject?
        
           | patelajay285 wrote:
           | Thanks! It makes it easier to run with the existing run
           | scripts I have on our large university GPU cluster. :) no
           | other reason
        
       | g4zj wrote:
       | Very cool, but I can't help but feel like titles that reference
       | low-LOC are a bit clickbait-y when nearly all the heavy lifting
       | is done by imported libraries.
        
         | patelajay285 wrote:
         | Appreciate the feedback on the title, this is developed for ML
         | researchers, so I assure there is a lot it's doing under the
         | hood to make this process easier (for example introducing
         | automatic caching and resumability).
         | 
         | However, we also tried to simplify the API and have sensible
         | defaults to make it usable for anyone / make ML research code
         | cleaner :)
        
       | imjonse wrote:
       | The first paragraphs says RLHF can be used to align models, and
       | the seconds say here's how to do it by using DPO. These two
       | methods are not the same, and the latter is not an instance of
       | the former.
        
         | Der_Einzige wrote:
         | The latter is strictly superior to the former though. RlHF has
         | been abandoned in the open source world.
        
           | imjonse wrote:
           | I am just saying the intro paragraphs are confusing.
        
             | patelajay285 wrote:
             | Thanks, appreciate the feedback, will update when I get a
             | chance!
        
           | patelajay285 wrote:
           | Yep, DPO is not technically "RL" and implicitly uses the LLM
           | itself as a reward model, but training with DPO is far more
           | stable for that reason.
        
             | espadrine wrote:
             | DPO is as close to RL as RLHF. The latter also uses the LLM
             | as a reward model.
             | 
             | I'm not a fan of the RL/SL dichotomy, because the line gets
             | so foggy. If you squint, every loss is a negative reward,
             | and every policy improvement a supervised target.
             | 
             | Still, what the code does isn't what is described in the
             | paper that the page links to.
        
               | nextaccountic wrote:
               | > I'm not a fan of the RL/SL dichotomy, because the line
               | gets so foggy. If you squint, every loss is a negative
               | reward, and every policy improvement a supervised target.
               | 
               | Isn't this just because reinforcement learning and
               | supervised learning are both optimization problems?
        
               | espadrine wrote:
               | In part, yes! But also because what used to define it was
               | the human-curated datasets: SL contained input/output
               | pairs, while RL contained episodes with sporadic rewards.
               | 
               | Nowadays, many datasets have different forms or are
               | synthetic. DPO uses datasets with both positive and
               | negative examples (instead of just a target output as
               | with traditional SL); RLHF uses synthetic rewards.
        
             | patelajay285 wrote:
             | I tend to agree @espadrine, it's semantics for the most
             | part
        
         | patelajay285 wrote:
         | Fair, DPO is considered a fairly well established technique now
         | that is far more stable in training than PPO, but also helps
         | align LLMs from human feedback. The package also helps do PPO,
         | so you can do traditional RLHF, but figured more people would
         | be interested in seeing a DPO example, given how unstable PPO
         | is.
        
       | lopkeny12ko wrote:
       | It's not 50 lines of code if all the real work is done by
       | importing a library...
       | 
       | That's like saying, I can solve any problem in 2 lines of code.
       | I'll publish a library for it first, then:
       | 
       | import foo; foo.do_the_thing()
       | 
       | Magic!
        
         | antonvs wrote:
         | Software developers hate this one simple trick!
        
         | peab wrote:
         | did people say the same thing when assembly code got abstracted
         | away?
        
           | skelpmargyar wrote:
           | Importing a library is not abstraction any more than closing
           | your eyes is abstracting the world to black.
        
       | mk_stjames wrote:
       | I feel the preparation and loading of the dataset has been
       | abstracted too far away. I have no idea what type of data format
       | I need or how it is loaded for this (it is using a pre-prepared
       | huggingface dataset?). If I have local data how should it be
       | loaded? What does that even look like? Is it expecting some sort
       | of JSON?
       | 
       | When you get so far as to abstracting every step to loading a
       | one-liner from huggingface, including the downloading of a
       | prepared dataset with no example of doing the same on custom
       | local dataset, you've abstracted too far to be useful for anyone
       | other than the first user.
        
         | patelajay285 wrote:
         | Thanks for the question. This is built for ML researchers, so
         | in examples we use the defacto source for datasets researchers
         | often use, HF Hub.
         | 
         | However, there is a lot of documentation on the site to help
         | guide users. This documentation page shows you can load in data
         | via local datasets as well. For example, JSON, CSV, text files,
         | a local HF Dataset folder, or even from a Python `dict` or
         | `list`:
         | 
         | https://datadreamer.dev/docs/latest/datadreamer.steps.html#t...
         | 
         | We'll definitely keep improving documentation, guides, and
         | examples. We have a lot of it already, and more to come! This
         | has only recently become a public project :)
         | 
         | If anyone has any questions on using it, feel free to email me
         | directly (email on the site and HN bio) for help in the
         | meantime.
        
           | mk_stjames wrote:
           | I did glance at the docs first before commenting but I was
           | looking in 'datasets' to try and understand importing a
           | potential CSV/JOSN etc and all I saw was verbage on accessing
           | the output.
           | 
           | I would not have guessed that the base input data processing
           | would have been filed under 'steps'. But now I kinda see how
           | you are working, but I admit I'm not the target audience.
           | 
           | If you want this to really take off for people outside of a
           | very, very specific class of researchers... setup an example
           | on your landing page that calls to a local JSON of user
           | prompts/answers/rejects finetuning a llama model with your
           | datadreamer.steps.JSONDataSource into the loader. Or, a txt
           | file with the system/user/assistant prompts tagged and
           | examples given. Yes, your 'lines of code' for your frontpage
           | example may grow a bit!
           | 
           | Maybe there are a lot of 'ML researchers' that are used to
           | the type of super-abstract OOP API, load-it-from-huggingface-
           | scheme-people you are targeting but also know that there are
           | a ton that aren't.
        
             | patelajay285 wrote:
             | That's totally fair and good feedback, it's hard to support
             | everyone's use cases simultaneously, but from my own
             | research and other researchers we collaborate with, this
             | solves and streamlines the right set of problems, but we
             | want to make this as broadly useful as possible. Always
             | happy to chat more / provide support if you would like,
             | feel free to reach out if you want to try it and run into
             | any sharp edges I could help make easier.
        
       | bbstats wrote:
       | I can abstract this to 2 lines
        
       | MrYellowP wrote:
       | I don't prefer aligned models and I'm a human. It's not okay to
       | claim that that's what humans prefer. There might be a subset of
       | humans who can't handle words, but they're not even remotely in
       | the majority.
       | 
       | Algined models are dumber, treat everyone like they're stupid
       | immature idiots who can't handle words and they're a wannabe
       | moral authority.
        
       | spdustin wrote:
       | It occurs to me that there must be a model that's been "aligned"
       | opposite to the usual RLHF. Or has nobody done that?
        
       | proto-n wrote:
       | Yeah well in bash I can do it in one line: `python train.py`. I
       | hate examples like this, the 50loc statement is totally useless
       | (and so is the code example, as I can't learning anything from
       | it).
        
       | rrr_oh_man wrote:
       | RLHF = Reinforcement Learning from Human Feedback
        
       | potatoman22 wrote:
       | This seems useful, thanks!
        
       | theptip wrote:
       | Interested if local RLHF is actually viable; can you get
       | meaningful steering from 1k feedback points on a narrow task? I
       | feel that annotation count is achievable with a single dedicated
       | annotator making a few comments per minute (though tedious), 10k
       | would be a week of work so achievable for a very dedicated
       | hobbyist, and 100k seems out of reach for a hobby project.
       | 
       | Say for simple conversation usecases (eg customer support for a
       | specific product, interactive fiction, things like that without
       | deep technical knowledge).
       | 
       | I was also wondering if it's possible to do such RLHF for SD
       | running locally.
        
       | ilaksh wrote:
       | How do you normally do DPO? Is that built in to PyTorch or
       | something?
       | 
       | Theoretically the hard part is collecting the examples with
       | rejections etc.
        
         | patelajay285 wrote:
         | Collecting data is hard, but the library is also a synthetic
         | data generation library, so for example you can create the data
         | for DPO fully synthetically, check out the self-rewarding LLMs
         | example:
         | https://datadreamer.dev/docs/latest/pages/get_started/quick_...
        
       | aethelyon wrote:
       | This is cool, but the data collection is the hard part, right?
        
         | patelajay285 wrote:
         | Yes it is :), but the library is also a synthetic data
         | generation library, so for example you can create the data for
         | DPO fully synthetically, check out the self-rewarding LLMs
         | example:
         | 
         | https://datadreamer.dev/docs/latest/pages/get_started/quick_...
        
           | sillysaurusx wrote:
           | I'm extremely skeptical of this approach. Until proven
           | otherwise, with a model that users actually find useful, I
           | don't think this can work.
           | 
           | It would be nice. But I've seen too many nice ideas
           | completely fall apart in practice to accept this without some
           | justification. Even if there are papers on the topic, and
           | those papers show that the models rank highly according to
           | some eval metrics, the only metric that truly matters is "the
           | user likes the model and it solves their problems."
           | 
           | By the way, on a separate topic, the 90/10 dataset split that
           | you do in all of your examples turns out to be fraught with
           | peril in practice. The issue is that the validation dataset
           | _quality_ turns out to be crucial, and randomly yeeting 10%
           | of your data into the validation dataset without manual
           | review is a recipe for problems.
        
             | patelajay285 wrote:
             | It's a demo snippet of how to setup the workflow, it's not
             | meant to be a working production example a self-rewarding
             | model or a faithful reproduction of the original paper.
             | Whether self-rewarding LLMs are a good idea or not, it's a
             | valuable and very active area of research in the literature
             | today. This is a library for ML researchers who should
             | actively research and study these avenues along with the
             | pitfalls you're mentioning. But in order for them to do
             | that, building these workflows have to be accessible to
             | them, which is what this library is meant to do. It's not
             | meant for the "hobbyist" ML-community, they should not be
             | using synthetic data today in this way, it would likely
             | lead to subpar results for any practical model or task.
        
       ___________________________________________________________________
       (page generated 2024-02-11 23:00 UTC)