https://datadreamer.dev/docs/latest/pages/get_started/quick_tour/aligning.html [ ] [ ] Hide navigation sidebar Hide table of contents sidebar Toggle site navigation sidebar Light Logo Dark Logo Toggle Light / Dark / Auto color theme Toggle table of contents sidebar Logo v0.10.0 Star [ ] Get Started * Installation * Quick Tour[*] Toggle navigation of Quick Tour + Synthetic Data Generation + Motivation and Design + Training an "Abstract to Tweet Model" with Fully Synthetic Data + Generating Training Data with Attributed Prompts + Distilling GPT-4 Capabilities to GPT-3.5 + Augmenting an Existing Dataset + Cleaning an Existing Dataset + Bootstrapping Synthetic Few-Shot Examples + Instruction-Tuning and Aligning Models + Instruction-Tuning a LLM + Aligning a LLM with Human Preferences + Training a Self-Improving LLM with Self-Rewarding * Motivation and Design * Overview Guide * Advanced Usage[ ] Toggle navigation of Advanced Usage + Caching and Saved Outputs + Creating a New DataDreamer ...[ ] Toggle navigation of Creating a New DataDreamer ... o Step o LLM o Trainer o Other + Parallelization[ ] Toggle navigation of Parallelization o Running Steps in Parallel o Running Models on Multiple GPUs o Training Models on Multiple GPUs + Quantization + Parameter-Efficient Training References * API Reference[ ] Toggle navigation of API Reference + datasets + embedders + errors + llms + retrievers + steps + task_models + trainers * Index About * GitHub * PyPI * License * Citation * Contact * Contributing Back to top Edit this page Toggle Light / Dark / Auto color theme Toggle table of contents sidebar Aligning a LLM with Human Preferences# In order to better align the responses instruction-tuned LLMs generate to what humans would prefer, we can train LLMs against a reward model or a dataset of human preferences in a process known as RLHF (Reinforcement Learning with Human Feedback). DataDreamer makes this process extremely simple and straightforward to accomplish. We demonstrate it below using LoRA to only train a fraction of the weights with DPO. from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFDPO from peft import LoraConfig with DataDreamer("./output"): # Get the DPO dataset dpo_dataset = HFHubDataSource( "Get DPO Dataset", "Intel/orca_dpo_pairs", split="train" ) # Keep only 1000 examples as a quick demo dpo_dataset = dpo_dataset.take(1000) # Create training data splits splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10) # Align the TinyLlama chat model with human preferences trainer = TrainHFDPO( "Align TinyLlama-Chat", model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_prompts=splits["train"].output["question"], train_chosen=splits["train"].output["chosen"], train_rejected=splits["train"].output["rejected"], validation_prompts=splits["validation"].output["question"], validation_chosen=splits["validation"].output["chosen"], validation_rejected=splits["validation"].output["rejected"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) Next Training a Self-Improving LLM with Self-Rewarding Previous Instruction-Tuning a LLM Copyright (c) 2024, Ajay Patel ([email protected]) Made with Furo