https://eugeneyan.com/writing/evals/ [ ] eugeneyan * Start Here * Writing * Speaking * Prototyping * About Task-Specific LLM Evals that Do & Don't Work [ llm eval machinelearning ] * 33 min read If you've ran off-the-shelf evals for your tasks, you may have found that most don't work. They barely correlate with application-specific performance and aren't discriminative enough to use in production. As a result, we could spend weeks and still not have evals that reliably measure how we're doing on our tasks. To save us some time, I'm sharing some evals I've found useful. The goal is to spend less time figuring out evals so we can spend more time shipping to users. We'll focus on simple, common tasks like classification/extraction, summarization, and translation. (Although classification evals are basic, having a good understanding helps with the meta problem of evaluating evals.) We'll also discuss how to measure copyright regurgitation and toxicity. * Classification: Recall, precision, ROC-AUC, PR-AUC, separation of distributions * Summarization: Consistency via NLI, relevance via reward model, length checks * Translation: Quality measures via chrF, BLEURT, COMET, COMETKiwi * Copyright: Exact regurgitation, near-exact reproduction * Toxicity: Proportion of toxic generations on regular and toxic prompts At the end, we'll discuss the role of human evaluation and how to calibrate the evaluation bar to balance between potential benefits and risks, and mitigate Innovator's Dilemma. Note: I've tried to make this accessible for folks who don't have a data science or machine learning background. Thus, it starts with the basics of classification eval metrics. Feel free to skip any sections you're already familiar with. Classification/Extraction: ROC, PR, class distributions Classification is the task of assigning predefined labels to text, such as sentiment (positive, negative) or topics (sports, politics). Extraction is similar, where we identify specific pieces of information within the text, such as names, dates, or locations. Here's an example: # Text input "Alice loves her iPhone 13 mini that she bought on September 16, 2022." # Classification and extraction output { "sentiment": "positive", # Sentiment classification "topic": "electronics", # Topic classification "toxicity_prob": "0.1", # Toxicity classification "names": [ # Name extraction "Alice", "iPhone 13 mini" ], "dates": [ # Date extraction "September 16, 2022" ] } While these tasks are relatively simple and LLMs likely perform well on them, we'll still want solid evaluations. For example, Voiceflow's eval harness for intent classification helped them catch a 10% performance drop when upgrading from the deprecating gpt-3.5-turbo-0301 to the more recent gpt-3.5-turbo-1106. We can apply LLMs for classification by providing a document and prompting the LLM to predict the sentiment or topic, or to check for abusive content or spam. The expected output can be a categorical label ("positive") or the probability of the label ("0.1"). Similarly, LLMs can extract information from a document by prompting it to return JSON with keys for desired attributes such as "names" and "dates". For categorical outputs, we can compute aggregate statistics such as recall, precision, false positives/negatives. This also applies to extraction: What proportion of ground truth attributes were extracted (recall)? What proportion of extracted attributes were correct (precision)? The Wikipedia page is a good reference. In a nutshell: * Recall: Proportion of true positives that were correctly identified. If there were 100 positive instances in our data and the model identified 80, recall = 0.8 * Precision: Proportion of the model's positive predictions that were correct. If the model predicted positive 50 times but only 30 were truly positive, precision = 0.6 * False positive: Model predicted positive but actually negative * False negative: Model predicted negative but actually positive IMHO, accuracy is too coarse a metric to be useful. We'd need to separate it into recall and precision at minimum, ideally across thresholds. It gets interesting when our models can output probabilities instead of simply categorical labels (e.g., language classifiers, reward models). Now we can evaluate performance across different probability thresholds, using metrics such as ROC-AUC and PR-AUC. The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the false positive rate at various thresholds, visualizing the performance of a classification model across all classification thresholds. The ROC Area Under the Curve (ROC-AUC) is an aggregate measure of performance that ranges from 0.0 to 1.0. A model that's no better than a coin flip would have ROC-AUC = 0.5 while a model that's always correct has ROC-AUC = 1.0. (Cramer would have ROC-AUC < 0.5.) ROC curve with ROC-AUC = 0.85 ROC curve with ROC-AUC = 0.85 ROC-AUC has some advantages. First, it's robust to class imbalance because it specifically measures true and false positive rate. In addition, it doesn't require picking a threshold since it evaluates performance across all thresholds. Finally, it is scale-invariant, thus it doesn't matter if your model's predictions are skewed. The Precision-Recall curve plots the trade-off between precision and recall across all thresholds. As we update the threshold for positive predictions, precision and recall change in opposite directions. A higher threshold leads to higher precision (fewer false positives) but lower recall (more false negatives), and vice versa. The area under this curve, PR-AUC, summarizes performance across all thresholds. A perfect classifier has PR-AUC = 1.0 while a random classifier has PR-AUC = proportion of positive labels. PR curves with PR-AUC = 0.87 PR curves with PR-AUC = 0.87 The standard PR curve (left below) plots precision and recall on the same line, starting from the top-right corner (high precision, low recall) and moving towards the bottom-left corner (low precision, high recall). I prefer a variant (right below) where precision and recall are plotted as separate lines--this makes it easier to understand the trade-off between precision and recall since they're both on the y-axis. Another useful diagnostic is plotting the distribution of predicted probabilities for each class. This visualizes how well the model is separating the classes. Ideally, we'd see two distinct peaks at 0.0 for the negative class and 1.0 for the positive class. This suggests that the model is confident in its predictions and can cleanly separate the classes. On the other hand, if there's significant overlap between the distributions, it suggests that it may be difficult to pick a threshold to use in production. Good separation of distributions with JS divergence = 11.078 Good separation of distributions (JS divergence = 11.078) To quantify the separation of distributions, we can compute the Jensen-Shannon divergence (JSD), a symmetric form of Kullback-Leibler (KL) divergence. Concretely, we compute the average of KL divergence from (i) distribution $P$ to the average of $P$ and $Q$ ($M$) and (ii) from distribution $Q$ to the average of $P$ and $Q$ ($M$). Nonetheless, I've found JSD hard to interpret and prefer to look at the graph directly. \[\operatorname{JSD}(P \parallel Q) = \frac{1}{2} \left(\operatorname {KL}(P \parallel M) + \operatorname{KL}(Q \parallel M)\right)\] Examining the separation of distributions is valuable because a model can have high ROC-AUC and PR-AUC but still not be suitable for production. For example, if a chunk of the predicted probabilities fall between 0.4 and 0.6 (below), it'll be hard to choose a threshold--getting it wrong by merely 0.05 could lead to a big drop in precision or recall. Examining the separation of distributions gives you a sense of this. Poor separation of distributions with JS divergence = 1.101 Poor separation of distributions (JS divergence = 1.101) The plot above also shows why n-gram and vector similarity evals/ guardrails don't work. The similarity distributions of positive and negative instances are too close. Thus, they are not discriminative enough to cut a threshold on. Together, these metrics provide a solid toolbox for diagnosing classification performance and picking good thresholds for production. Diagnostic plots for classification tasks Diagnostic plots for classification tasks Now that we've the basics of evaluating classification tasks, we can discuss evals for summarization which, unsurprisingly, can be simplified to classification tasks too. Summarization: Consistency, relevance, length Abstractive summarization is the task of generating concise summaries that capture the key ideas in a source document. Unlike extractive summarization which lifts entire sentences from the original text, abstractive summarization involves rephrasing and condensing information to create a newer, shorter version. It requires understanding the content, identifying important points, and not introducing hallucination defects. To evaluate abstractive summaries, Kryscinski et al. (2019) proposed four key dimensions: * Fluency: Are sentences in the summary well-formed and easy to read? We want to avoid grammatical errors, random capitalization, etc. * Coherence: Does the summary as a whole make sense? It should be well-structured and logically organized, and not just a jumble of information. * Consistency: Does the summary accurately reflect the content of the source document? We want to ensure there's no new or contradictory information added. * Relevance: Does the summary focus on the most important aspects of the source document? It should include key points and exclude less relevant details. Most modern language models can generate grammatically correct and readable sentences, making fluency less of a concern. A recent benchmark excluded fluency as an eval for this reason. Coherence is also becoming less of an issue, especially for short summaries containing a few sentences or less. This leaves us with factual consistency and relevance, which we can frame as binary classification and reuse the metrics from above. I seldom see grammatical errors or incoherent text from a decent LLM (maybe 1 in 10k). Thus, no need to invest in evaluating fluency and coherence. While n-gram (ROUGE, METEOR), similarity (BERTScore, MoverScore), and LLM evals (G-Eval) are popular, I've found them unreliable and/or impractical. Thus, we won't discuss them here. See a more detailed critique in the appendix. To measure factual consistency, we can finetune a natural language inference (NLI) model as a learned metric. A recap on the NLI task: Given a premise sentence and a hypothesis sentence, the task is to predict whether the hypothesis is entailed by (logically flows from), neutral to, or contradicts the premise. Premise and hypothesis for the Natural Language Inference Task Premise and hypothesis for the Natural Language Inference Task We can use NLI models to evaluate the factual consistency of summaries too. The key insight is to treat the source document as the premise and the generated summary as the hypothesis. If the summary contradicts the source, then the summary is factually inconsistent aka a hallucination. Document and summary for the Natural Language Inference Task Document and summary for the Natural Language Inference Task By default, NLI models return probabilities for entailment, neutral, and contraction. To get the probability of factual inconsistency, we drop the neutral dimension, apply a softmax to the remaining entailment and contradiction dimensions, and take the probability of contradiction. Be sure to check what your NLI model's dimension represents--Google's T5 NLI model has entailment at dim = 1 while Meta's BART NLI model has it at dim = 2! def get_prob_of_contradiction(logits: torch.Tensor) -> torch.Tensor: """ Returns probability of contradiction aka factual inconsistency. Args: logits (torch.Tensor): Tensor of shape (batch_size, 3). The second dimension represents the probabilities of contradiction, neutral, and entailment. Returns: torch.Tensor: Tensor of shape (batch_size,) with probability of contradiction. Note: This function assumes the probability of contradiction is in index 0 of logits. """ # Drop neutral logit (index=1), softmax, and get prob of contradiction (index=0) prob = F.softmax(logits[:, [0, 2]], dim=1)[:, 0] return prob With a few hundred task-specific samples, the model starts to identify obvious factual inconsistencies and likely outperforms n-gram, similarity, and LLM-based evals. With a thousand samples or more, it becomes a solid factual consistency eval and may be good enough as a hallucination guardrail. To reduce the need for data annotation, we can bootstrap with open-source, permissive use data such as the Factual Inconsistency Benchmark (FIB) and the Unified Summarization Benchmark (USB). The graphs below plot the performance of NLI evals for factual inconsistency on FIB. The top graphs have performance pre-finetuning while the bottom graphs show performance after finetuning on USB and FIB. While there's certainly room for improvement, it shows how a little finetuning on open-source, permissive-use data can help improve ROC-AUC from 0.56 (which is practically random) to 0.85! Plots for the NLI-based eval of factual inconsistency before finetuning Plots for the NLI-based eval of factual inconsistency after finetuning Factual inconsistency eval before (top; ROC-AUC=0.56) and after (bottom; ROC-AUC=0.85) finetuning I think it's hard to beat the NLI approach to evaluate and/or detect factual inconsistency in terms of ROI. If you know of anything better, please DM me! The same paradigm can also be applied to develop a learned metric of relevance. In a nutshell, we'd collect human judgments on the relevance of generated summaries and then finetune an NLI model to predict these relevance ratings. An alternative is to train a reward model on human preferences. Stiennon et al. (2020), the predecessor of InstructGPT, trained a reward model to evaluate abstractive summaries of Reddit posts. Wu et al. (2021) also did similar work with fiction novels. In Stiennon et al. (2020), they updated their summarization language model to return a numeric score instead of a text summary, making it a reward model that scores the quality of summaries. This is done by adding a linear head that outputs a scalar value. It was then trained on pairs of summary preferences to give higher scores to better summaries. For each pair of summaries $y_0$ and $y_1$, they minimize the following loss function: \[\text{loss}(r_{\theta}) = - \mathbb{E}_{(x, y_0, y_1, i) \sim D} \ left[ \log \left( \sigma \left( r_{\theta}(x, y_i) - r_{\theta}(x, y_ {1-i}) \right) \right) \right]\] Intuitively, this loss function encourages the reward model to give a higher score to the summary preferred by humans. The sigmoid function $\sigma$ squashes the difference in rewards (between the two summaries) to between 0.0 and 1.0. After training, they normalize the reward model's output so that the reference summaries from their dataset achieve a mean score of zero. This provides a baseline for comparing the quality of generated summaries. A related task is opinion summarization. This is where we generate a summary that captures the key aspects and associated sentiments from a set of opinions, such as customer feedback, social media, or product reviews. We adapt the metrics of consistency and relevancy for: * Sentiment consistency: For each key aspect, does the summary accurately reflect the overall sentiment expressed? For example, if most reviews praise the battery life but criticize the camera quality, the summary should capture this. * Aspect relevance: Does the summary cover the main topics discussed? If many reviews raise concerns about battery life and camera quality, these points should be included in the summary. The OpinSummEval paper explored several evals and found two to be most effective: BARTScore and Question-Answering (QA) based evals. It uses the test set from the Yelp dataset which contains 100 instances of (i) eight reviews of the same product/service and (ii) one human-written review summary. BARTScore treats evaluation as a text-generation task. It uses pre-trained BART to compute the conditional probability of the summary $y$ given the reviews $x$. The score is essentially the log-likelihood of generating the summary from the reviews. \[\text{BARTScore} = \sum_{t} \omega_t \log p(y_t|y_{> --------------------------------------------------------------------- Join 9,300+ readers getting updates on machine learning, RecSys, LLMs, and engineering. [ ] Get email updates --------------------------------------------------------------------- * [bluesky] Bluesky * [icon-twitt] Twitter * [icon-linke] LinkedIn * [icon-githu] GitHub Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He's currently a Senior Applied Scientist at Amazon. Previously, he led machine learning at Lazada (acquired by Alibaba) and a Healthtech Series A. He writes & speaks about machine learning, recommenders, LLMs, and engineering at eugeneyan.com and ApplyingML.com. (c) Eugene Yan 2015 - 2024 * Feedback * RSS