https://datadreamer.dev/docs/latest/pages/get_started/quick_tour/aligning.html

[ ] [ ]
Hide navigation sidebar
Hide table of contents sidebar
Toggle site navigation sidebar
 
Light Logo Dark Logo
Toggle Light / Dark / Auto color theme
Toggle table of contents sidebar
 
Logo
v0.10.0
Star
[                    ] 
Get Started

  *  Installation
  *  Quick Tour[*]
    Toggle navigation of  Quick Tour
      + Synthetic Data Generation
      + Motivation and Design
      + Training an "Abstract to Tweet Model" with Fully Synthetic
        Data
      + Generating Training Data with Attributed Prompts
      + Distilling GPT-4 Capabilities to GPT-3.5
      + Augmenting an Existing Dataset
      + Cleaning an Existing Dataset
      + Bootstrapping Synthetic Few-Shot Examples
      + Instruction-Tuning and Aligning Models
      + Instruction-Tuning a LLM
      + Aligning a LLM with Human Preferences
      + Training a Self-Improving LLM with Self-Rewarding
  *  Motivation and Design
  *  Overview Guide
  *  Advanced Usage[ ]
    Toggle navigation of  Advanced Usage
      + Caching and Saved Outputs
      + Creating a New DataDreamer ...[ ]
        Toggle navigation of Creating a New DataDreamer ...
          o Step
          o LLM
          o Trainer
          o Other
      + Parallelization[ ]
        Toggle navigation of Parallelization
          o Running Steps in Parallel
          o Running Models on Multiple GPUs
          o Training Models on Multiple GPUs
      + Quantization
      + Parameter-Efficient Training

References

  * API Reference[ ]
    Toggle navigation of API Reference
      + datasets
      + embedders
      + errors
      + llms
      + retrievers
      + steps
      + task_models
      + trainers
  * Index

About

  * GitHub
  * PyPI
  * License
  * Citation
  * Contact
  * Contributing

Back to top
Edit this page
Toggle Light / Dark / Auto color theme
Toggle table of contents sidebar

Aligning a LLM with Human Preferences#

In order to better align the responses instruction-tuned LLMs
generate to what humans would prefer, we can train LLMs against a
reward model or a dataset of human preferences in a process known as
RLHF (Reinforcement Learning with Human Feedback).

DataDreamer makes this process extremely simple and straightforward
to accomplish. We demonstrate it below using LoRA to only train a
fraction of the weights with DPO.

from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFDPO
from peft import LoraConfig

with DataDreamer("./output"):
    # Get the DPO dataset
    dpo_dataset = HFHubDataSource(
        "Get DPO Dataset", "Intel/orca_dpo_pairs", split="train"
    )

    # Keep only 1000 examples as a quick demo
    dpo_dataset = dpo_dataset.take(1000)

    # Create training data splits
    splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10)

    # Align the TinyLlama chat model with human preferences
    trainer = TrainHFDPO(
        "Align TinyLlama-Chat",
        model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        peft_config=LoraConfig(),
        device=["cuda:0", "cuda:1"],
        dtype="bfloat16",
    )
    trainer.train(
        train_prompts=splits["train"].output["question"],
        train_chosen=splits["train"].output["chosen"],
        train_rejected=splits["train"].output["rejected"],
        validation_prompts=splits["validation"].output["question"],
        validation_chosen=splits["validation"].output["chosen"],
        validation_rejected=splits["validation"].output["rejected"],
        epochs=3,
        batch_size=1,
        gradient_accumulation_steps=32,
    )

 
Next
Training a Self-Improving LLM with Self-Rewarding
 
Previous
Instruction-Tuning a LLM
Copyright (c) 2024, Ajay Patel ([email protected])
Made with Furo