[HN Gopher] Beware of Unreliable Data in Model Evaluation: A LLM...
       ___________________________________________________________________
        
       Beware of Unreliable Data in Model Evaluation: A LLM Prompt
       Selection Case Study
        
       Author : cmauck10
       Score  : 35 points
       Date   : 2023-06-29 17:28 UTC (5 hours ago)
        
 (HTM) web link (cleanlab.ai)
 (TXT) w3m dump (cleanlab.ai)
        
       | neeleshs wrote:
       | Given that LLMs fail to give consistent answers to the same
       | questions, how does that factor into these studies?
        
         | cmauck10 wrote:
         | Most LLMs allow you to specify the temperature parameter that
         | governs the randomness and thus the creativity of the
         | responses. For this experiment I used a very low temperature to
         | ensure consistency for a given prompt.
        
       | cmauck10 wrote:
       | It's pretty common now for data scientists and ML engineers to
       | validate the quality of their training data being fed into these
       | LLMs, but what about their test data used to evaluate them?
       | 
       | I spent some time playing around with the FLAN-T5 open-source LLM
       | from Google Research and I discovered that noisy test/evaluation
       | data can actually cause you to choose sub-optimal prompts.
       | 
       | Given two prompts A and B, I found multiple cases where prompt A
       | performed better on the observed (noisy) test data, yet worse on
       | the high-quality test data. In reality, this means that you would
       | choose A as the "best prompt" when prompt B is actually the
       | better one. I also proved the accuracy difference to be
       | significant via McNemar's Test.
       | 
       | This article explains my methodology and how I used data-centric
       | AI to automatically clean the noisy test data in order to ensure
       | optimal prompt selection.
        
       ___________________________________________________________________
       (page generated 2023-06-29 23:01 UTC)