[HN Gopher] Beware of Unreliable Data in Model Evaluation: A LLM...
___________________________________________________________________
Beware of Unreliable Data in Model Evaluation: A LLM Prompt
Selection Case Study
Author : cmauck10
Score : 35 points
Date : 2023-06-29 17:28 UTC (5 hours ago)
(HTM) web link (cleanlab.ai)
(TXT) w3m dump (cleanlab.ai)
| neeleshs wrote:
| Given that LLMs fail to give consistent answers to the same
| questions, how does that factor into these studies?
| cmauck10 wrote:
| Most LLMs allow you to specify the temperature parameter that
| governs the randomness and thus the creativity of the
| responses. For this experiment I used a very low temperature to
| ensure consistency for a given prompt.
| cmauck10 wrote:
| It's pretty common now for data scientists and ML engineers to
| validate the quality of their training data being fed into these
| LLMs, but what about their test data used to evaluate them?
|
| I spent some time playing around with the FLAN-T5 open-source LLM
| from Google Research and I discovered that noisy test/evaluation
| data can actually cause you to choose sub-optimal prompts.
|
| Given two prompts A and B, I found multiple cases where prompt A
| performed better on the observed (noisy) test data, yet worse on
| the high-quality test data. In reality, this means that you would
| choose A as the "best prompt" when prompt B is actually the
| better one. I also proved the accuracy difference to be
| significant via McNemar's Test.
|
| This article explains my methodology and how I used data-centric
| AI to automatically clean the noisy test data in order to ensure
| optimal prompt selection.
___________________________________________________________________
(page generated 2023-06-29 23:01 UTC)