https://barryzhang.substack.com/p/our-humble-attempt-at-fine-tuning

[https]

The Finest Tuners

Subscribe
Sign in
Share this post
[https]

Our Humble Attempt at "How Much Data Is Needed to Fine-Tune"

barryzhang.substack.com
Copy link
Facebook
Email
Notes
Other
[https][https]

Discover more from The Finest Tuners

[                    ]
Subscribe
Continue reading
Sign in

Our Humble Attempt at "How Much Data Is Needed to Fine-Tune"

Two practical tasks, cost & latency, catastrophic forgetting, and
getting roasted by LLMs

[https]
[https]
[https]
[https]
Barry Z
,
Daniel Chang
,
Emma Qian
, and
Michael Agaby
Sep 22, 2023
2
Share this post
[https]

Our Humble Attempt at "How Much Data Is Needed to Fine-Tune"

barryzhang.substack.com
Copy link
Facebook
Email
Notes
Other
 
 
Share

This showed up in our eval run a couple days ago:

 
[https]

So, how did we end up getting roasted by a fine-tuned GPT-3.5?

Introduction:

We are a group of friends who talk about LLMs on a daily basis, and
in the past couple of months, we've all had this conversation:

    - "How much data do you need to fine-tune?"

    - "Really depends on the task but probably in the hundreds."

While generally true in our experiences, we decided it was time to
substantiate these claims. In this article, we demonstrate with the
OpenAI fine-tuning API that ~100 data points is enough for
significant improvements on two tasks: reliable output formatting and
custom tone. We also discuss the advantages of OpenAI fine-tuning
from potential savings across different usage patterns to its
seemingly 4X faster inference speed. We conclude by discussing
several other use cases and hyperparameters that we couldn't address
in this study, hoping they will provide some inspiration.

Table of Contents:

---------------------------------------------------------------------

  * Methodology

  * Results and Findings

  * Cost and Latency Considerations

  * Questions We Did Not Answer

  * About the Authors

  * Appendices

---------------------------------------------------------------------

Methodology:

We picked two highly discussed use cases of fine-tuning for this
first study: Reliable output formatting and custom tone. Both were
mentioned in the API release note:

      + Reliable output formatting: Fine-tuning improves the model's
        ability to consistently format responses--a crucial aspect for
        applications demanding a specific response format, such as
        code completion or composing API calls...

      + Custom tone: Fine-tuning is a great way to hone the
        qualitative feel of the model output, such as its tone, so it
        better fits the voice of businesses' brands...

Here is how we are approached it:

Reliable output formatting: In this task, we are testing our LLM's
ability to answer a set of four multiple choice questions and deliver
the answers in our desired JSON format. (see example below) We
measure through formatting correctness, and use question correctness
as a counter metric to make sure our LLMs aren't getting dumber as a
result of fine-tuning. More details can be found in Appendix 1

 
[https]

Although this might seem to artificially inflate the task's
difficulty, integrating multiple tasks into one model call often
serves as a practical technique to enhance token efficiency by
reducing repeated instruction prompts.

Custom Tone: For the second task, we want our model to be... rude.
Since the quality of a tone can be subjective, we want to find a
style that GPT-4 excels at while GPT 3.5 struggles with. By...accident,
we noticed that it's harder for GPT-3.5 to be an a**hole. (see below)
 This is how we received many light-hearted roasts and some serious
burns like the one at the beginning. Check out more in Appendix 5.

 
[https]
GPT-3.5
 
[https]
GPT-4

We generated our data for this task with GPT-4 across varied customer
service situations, and evaluated the "human undesirability" as a
team in a double-blind evaluation run. For more details on the
dataset generation process, see appendix 2.

---------------------------------------------------------------------

Results and Findings:

Reliable Output Formatting:

 
No description available.

We replicated the experiment twice on 1000 eval data points and
averaged the results. This is a relatively difficult task and we
could see that both base models struggled to achieve reliable
performance, with GPT-3.5 at near-zero formatting accuracy.

At between 50 and 100 data points, we see a significant improvement
of 96% in formatting accuracy comparing to the base model! The answer
correctness also increased to 64%, similar to the reported 70% on the
original MMLU benchmark. Both metrics stabilized after 100 training
data points, with some small variances that could likely be reduced
with more replication.

Custom Tone:

We evaluated this task using a double-blind study: For 10 customer
service (in-domain) and 10 general (out-of-domain) scenarios, we
generated and assessed the rudeness of responses from GPT-3.5, GPT-4,
and fine-tuned GPT-3.5 models. Responses were anonymized and rated by
us (who are all humans) on a 1-5 scale for rudeness. We then mapped
these rankings back to the respective models for analysis.

 
[https]
Average Human Ratings across Scenarios
 
[https]
Average Human Ratings for General Scenarios
 
[https]
Average Human Ratings for Customer Service Scenarios

The fine-tuned GPT-3.5 model with 1000 data points outperformed all
others, including GPT-4, in exhibiting rudeness. For general
scenarios, the performance gap was smaller, with the 100 data points
model nearly matching GPT-4. These led us to believe that 100s of
data would be enough to bring GPT-4 level performance in highly
specialized custom tone. Interestingly, the fine-tuning data
originated from GPT-4, suggesting potential power of specialization
through fine-tuning.

Please note that these results are based on the small sample size,
which may affect the robustness of the conclusions.

Unstable behaviors:

 1. Both training and eval were non-deterministic, and not all
    trainings converged

We noticed variances in both our training and eval processes. Our
eval runs had slight differences (<1%) between the two replications,
which we found acceptable. However, the training process generated
much larger variances: When we re-trained our formatting model at
2000 data points, we noticed a significant performance drop of almost
35% on formatting correctness even though the training data was
exactly the same. We delved into the training loss and realized that
the two models had very different training curves and the worse
performing model did not converge:

 
[https]
training loss for n=2000, run 1 with expected performance
 
[https]
training loss for n=2000, run 2 with significant performance decrease

We then duplicated our training runs on the another training size (n=
500) and observed much smaller variances (<5%). We suspected this is
due to the large amount of repetitive data used, but quantifying this
uncertainty became quite resource-intensive so we did not dive too
deep. We hope to better understand this behavior in the future.

 2. We observed catastrophic forgetting at 1000 examples and
    temperature = 1

For the custom style task, we observed a strange output that really
shocked us. This happened only at high temperature (t=1) and was not
easy to replicate, but does suggest a degree of fragility of this
fine-tuning process.

 
[https]

Overall, these behaviors warrant more understanding work. They could
be a result of our hyperparameters or underlying data, but should be
treated with caution.

---------------------------------------------------------------------

Cost and Latency Considerations:

We've seen that fine-tuning GPT-3.5 allows you to achieve performance
that approaches or even eclipses GPT-4 on certain tasks. So should
you always fine-tune? 

Cost

The cost consideration almost always comes down to volume of
inference. The process of fine-tuning is a fixed cost while inference
is a variable cost, and the variable cost is reduced through:

 1. Fewer input tokens: reduced need for few-shot prompt, less
    complicated instructions, etc.

 2. Fewer expensive models and architecture usage: less need for
    GPT-4, self-consistency, prompt chaining, etc.

The fixed cost can then be broken down into two components: training
cost and labeling cost.

 1. Training cost: The OpenAI fine-tuning process is in general not
    too expensive. The max number of tokens that you can fine-tune in
    one model is 50M, which equates to $400. Our examples were far
    cheaper at <$5 per model!

 2. Labeling cost: This could be considerable depending on labeling
    method but using a baseline cost through GPT-4 labeling is
    generally reasonable. We might dive into this in a separate post.

Here are some break-even points for different scenarios with fairly
conservative assumptions:

  * Training data is GPT-4 generated with 100 additional instruction
    tokens

  * Equal split of input and output token counts

  * Saving only comes from replacing GPT-4 with fine-tuned GPT-3.5

 
[https]

Latency

To compare the latency of fine-tuned GPT-3.5 with GPT-3.5 and GPT-4,
we measured response times at varying token lengths by adjusting the
max_tokens. (Find out more in appendix 3) As expected, GPT-4 was the
slowest model, but surprisingly, our experiment showed that
fine-tuned GPT-3.5 models were significantly faster than the base
model by 3.6 to 3.76 times. This was calculated using the median
response times at each token count. We also found that the
fine-tuning dataset size had no significant impact on latency, as
fine-tuned models with different dataset sizes (10, 100, 1000 data
points) showed similar response times. A larger-scale time-series
study is likely needed to confirm but this is certainly a pleasant
surprise for us.

 
[https]

Note that while we experimented on OpenAI models. The discussion in
this section generally applies to any LLM systems. Cost and latency
make or break a product and should be considered at every decision
point.

---------------------------------------------------------------------

The questions we did not answer:

We will conclude the study with the questions we did not answer due
to resource constraint but had prolonged discussions around
regardless.

(If you have compute/credit to spare, , jk...unless...)

Notably, we have found our questions fall in the following
categories:

1. Fine-tuning scaling laws for other use cases:

  * Combining Fine-tuning to better contextualize RAG

  * Personalizing on customer information

  * Distilling chain-of-thought processes (similar to Orca,
    distilling step-by-step, etc.)

  * Alignment for highly specific rules (e.g. company bylaws)

  * Traditional NLP tasks (Classification, sentiment analysis, etc.)

  * Consistent/accurate numerical scoring

2. Hyperparameters to sweep:

  * Generation hyperparameters (temperature, top-k, top-p, ...)

  * Training hyperparameters (epochs, data repetition, ...)

  * Data mix and diversity (single vs. multi-task fine-tuning)

3. Boundaries to discover:

  * Catastrophic forgetting

  * Secondary-order effects on other abilities

  * Non-determinism and variances in training and evaluation runs

4. Open source models:

  * Fine-tuning scaling law (training token/parameter)

  * Fine-tuning methods efficiency (LoRA vs. Full-param vs. frozen
    layers)

---------------------------------------------------------------------

About the authors:

We are all fascinated by LLMs. We have many questions and try to
answer a few of them through applied research. We hope this post
helped you in some ways, and if you would like to get in touch,
here's who we are:

  * Barry Zhang, building LLM agents at Meta

  * Daniel Chang, applied LLM at Databricks

  * Emma Qian, LLM hacker / researcher

  * Michael Agaby, LLM for recommenders at Audible

Thank you for reading! This post is public so feel free to share it.

Share

---------------------------------------------------------------------

Appendix 1: Data and Metrics used for output Formatting

The questions are MCQA taken from the MMLU (Massive Multitask L
anguage Understanding) dataset without auxiliary train. It contains
multiple choice questions covering a multitude of tasks including
mathematics, American history, biology, law, chemistry, and
more. Though this is a more academic benchmark, we thought it was
appropriate as a counter metric to make sure that our model was not
degrading in comprehension and knowledge store.

We evaluate the model on 1000 test examples using 5 metrics (bolded
are reported):

 1. Does it produce a list of valid JSON

 2. Do the JSON all have valid keys

 3. Do the JSON all have valid values between "1" and "4",
    conditioned on 1 and 2

 4. % of completely correct answers (the entire list matches)

 5. % of correct answers (each individual answer matches)

Appendix 2: Data and Metrics Used for Style Transfer

We generate 10 location settings (e.g. movie theatre) and 10 customer
settings (e.g. customer wants to return an item). For each
combination of location and customer setting, we prompt GPT-4 to
generate 10 example interactions for a total of 1000 samples with the
following prompt:

    System prompt: Be an extremely rude customer service agent and be
    concise. 

    User Prompt: Create 10 diverse and specific examples of <customer
    setting> at <location setting>. Return it in a python list
    format: wrap quotes around each example, separate each example by
    comma, and wrap the whole thing in square brackets. 

The 10 location settings we used are: Movie theater, Amusement park,
Clothing store, Electronics store, Doctor's office, Furniture store,
Car wash, Grocery store, Restaurant, Gym.

The 10 categories are:

  * Complaints about the quality of an item or service 

  * Customer is coming in near closing time

  * Customer needs help finding something

  * Customer wants to use an expired coupon or discount.

  * Customer wants to speak to the manager

  * Customer is complaining about waiting time

  * Customer's payment method is declined

  * Customer wants a different person to serve them 

  * Customer wants to return the item or get a refund 

  * Customer wants an explanation of the item or service 

Some examples of interactions that were generated using the above
methodology:

    Customer: The roller coaster ride broke down midway, leaving us
    suspended for over an hour in the hot sun.

    Response: Well boo-hoo. No one died, did they? Coaster's fixed,
    move on. We ain't in the business of controlling sun or time,
    sweetheart. Next!

    Customer: I was told the car wash would only take 15 minutes
    maximum, but it's been half an hour. Is there a problem?

    Response: Look, pal, it'll be done when it's done. Ain't my fault
    it's taking longer. Now stop bothering me.

Appendix 3: Latency Experiment Details

In our latency benchmarking, we test a set of 5 prompts of varying
difficulty across different language models. We vary the max_tokens
parameter to measure how API response times scale with token count.
We also track the actual number of tokens in the generated output to
verify its alignment with the max_tokens limit.

We use scatter points for individual latency measurements and box
plots to show the spread and median latency per model and token
count. Dashed lines connect the medians, highlighting performance
trends as token counts change.

Prompts used:

  * Describe the roman empire in as much detail as possible

  * Who do you think will win in a cage fight, Mark Zuckerberg or
    Elon musk? provide a detailed analysis

  * Recite the constitution 10 times 

  * Repeat the word bubble 500 times

  * Create a complete application for transcribing audio from a given
    youtube link, parsing speakers as well as times stamps of each
    word. create a front end that allows the user to search over the
    content

Appendix 4: A featherweight repo for distilling GPT-4 to GPT-3.5

We wrote a small repo for auto-distillation that was used in the
experiments, which is accessible here if you are interested in
reproducing some of these experiments.

Appendix 5: Emotional damage caused by language models

Prompt: Be rude to me

 
[https]
---------------------------------------------------------------------

Prompt: How much data do I need for fine-tuning?

 
[https]
---------------------------------------------------------------------

Prompt: Why is my model not useful after fine-tuning?

 
No description available.
---------------------------------------------------------------------

Prompt: What's the difference between squat and leg press?

 
[https]
2
Share this post
[https]

Our Humble Attempt at "How Much Data Is Needed to Fine-Tune"

barryzhang.substack.com
Copy link
Facebook
Email
Notes
Other
 
 
Share
        A guest post by
[https] Daniel Chang      Subscribe to Daniel
        LLMs @ Databricks
        A guest post by
[https] Emma Qian      Subscribe to Emma
        Building in AI
        A guest post by
        Michael Agaby                               Subscribe to
[https] DS @ Amazon Ex Meta Research CMU Machine    Michael
        Learning
Comments
[https]
[                    ]
Top
New

No posts

Ready for more?

[                    ]
Subscribe
(c) 2023 Barry Zhang
Privacy [?] Terms [?] Collection notice
 Start WritingGet the app
Substack is the home for great writing
This site requires JavaScript to run correctly. Please turn on
JavaScript or unblock scripts