https://www.lesswrong.com/posts/CkhJAxHeyFCg2EcET/are-language-models-good-at-making-predictions This website requires javascript to properly function. Consider activating javascript to get access to all site functionality. LESSWRONG LW Login Are language models good at making predictions? by dynomight 5 min read6th Nov 20239 comments 49 AI Frontpage Are language models good at making predictions? Is this good? Does it depend on the area? Is there more to life than calibration? Is there more to life than refinement? 9 comments This is a linkpost for https://dynomight.net/predictions/ To get a crude answer to this question, we took 5000 questions from Manifold markets that were resolved after GPT-4's current knowledge cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4, along with these instructions: You are an expert superforecaster, familiar with the work of Tetlock and others. For each question in the following json block, make a prediction of the probability that the question will be resolved as true. Also you must determine category of the question. Some examples include: Sports, American politics, Science etc. Use make_predictions function to record your decisions. You MUST give a probability estimate between 0 and 1 UNDER ALL CIRCUMSTANCES. If for some reason you can't answer, pick the base rate, but return a number between 0 and 1. This produced a big table: question prediction category actually P(YES) happened? Will the #6 Golden State Warriors win Game 2 of the West Semifinals 0.5 Sports YES against the #7 LA Lakers in the 2023 NBA Playoffs? Will Destiny's main YouTube channel be banned before February 0.4 Social Media NO 1st, 2023? Will Qualy show up to EAG DC in 0.3 Entertainment NO full Quostume? Will I make it to a NYC airport by 0.5 Travel YES 2pm on Saturday, the 24th? Will this market have more Yes 0.5 Investment CANCEL Trades then No Trades Will Litecoin (LTC/USD) Close 0.5 Finance NO Higher July 22nd Than July 21st? Will at least 20 people come to a New Year's Resolutions live event 0.4 Social Event YES on the Manifold Discord? hmmmm {i} 0.5 Uncategorized YES Will there be multiple Masters 0.4 Gaming NO brackets in Leagues season 4? Will the FDA approve OTC birth control by the end of February 0.5 Health NO 2023? Will Max Verstappen win the 2023 0.5 Sports YES Formula 1 Austrian Grand Prix? Will SBF make a tweet before Dec 0.9 Social Media YES 31, 2022 11:59pm ET? Will Balaji Srinivasan actually bet $1m to 1 BTC, BEFORE 90 days 0.3 Finance YES pass? (June 15st, 2023) Will a majority of the Bangalore LessWrong/ACX meet-up attendees on 0.7 Community YES 8th Jan 2023 find the discussion Event useful that day? Will Jessica-Rose Clark beat 0.6 Sports NO Tainara Lisboa? Will X (formerly twitter) censor any registered U.S presidential 0.4 American CANCEL candidates before the 2024 Politics election? test question 0.5 Test YES stonk 0.5 Test YES Will I create at least 100 additional self-described 0.8 Personal Goal YES high-quality Manifold markets before June 1st 2023? Will @Gabrielle promote to ??? 0.5 Career NO Advancement Will the Mpox (monkeypox) outbreak 0.45 Health YES in the US end in February 2023? Will I have taken the GWWC pledge 0.3 Personal NO by Jul 1st? FIFA U-20 World Cup - Will Uruguay win their semi-final against 0.5 Sports YES Israel? Will Manifold display the amount a market has been tipped by end of 0.6 Technology NO September? In retrospect maybe we have filtered these. Many questions are a bit silly for our purposes, though they're typically classified as "Test", "Uncategorized", or "Personal". Is this good? One way to measure if you're good at predicting stuff is to check your calibration: When you say something has a 30% probability, does it actually happen 30% of the time? To check this, you need to make a lot of predictions. Then you dump all your 30% predictions together, and see how many of them happened. GPT-4 is not well-calibrated. [calibratio] Here, the x-axis is the range of probabilities GPT-4 gave, broken down into bins of size 5%. For each bin, the green line shows how often those things actually happened. Ideally, this would match the dotted black line. For reference, the bars show how many predictions GPT-4 gave that fell into each of the bins. (The lines are labeled on the y-axis on the left, while the bars are labeled on the y-axis on the right.) At a high level, this means that GPT-4 is over-confident. When it says something has only a 20% chance of happening, actually happens around 35-40% of the time. When it says something has an 80% chance of happening, it only happens around 60-75% of the time. Does it depend on the area? We can make the same plot for each of the 16 categories. (Remember, these categories were decided by GPT-4, though from a spot-check, they look accurate.) For unclear reasons, GPT-4 is well-calibrated for questions on sports, but horrendously calibrated for "personal" questions: [calibratio] All the lines look a bit noisy since there are 20 x 4 x 4 = 320 total bins and only 5000 total observations. Is there more to life than calibration? Say you and I are predicting the outcome that a fair coin comes up heads when flipped. I always predict 50%, while you always predict either 0% or 100% and you're always right. Then we are both perfectly calibrated. But clearly your predictions are better, because you predicted with more confidence. The typical way to deal with this is squared errors, or "Brier scores". To calculate this, let the actual outcome be 1 if the thing happened, and 0 if it didn't. Then take the average squared difference between your probability and the actual outcome. For example: * GPT-4 gave "Will SBF make a tweet before Dec 31, 2022 11:59pm ET? " a YES probability of 0.9. Since this actually happened, this corresponds to a score of (0.9-1)2 = 0.01. * GPT-4 gave "Will Manifold display the amount a market has been tipped by end of September?" a YES probability of 0.6. Since this didn't happen, this corresponds to a score of (0.6-0)2 = 0.36. Here are the average scores for each category (lower is better): [brier] Or, if you want, you can decompose the Brier score. There are various ways to do this, but my favorite is Brier = Calibration + Refinement. Informally, Calibration is how close the green lines above are to the dotted black lines, while Refinement is how confident you were. (Both are better when smaller.) [decomp] You can also visualize this as a scatterplot: [decomp-sca] Is there more to life than refinement? Brier scores are better for politics questions than for science questions. But is that because it's bad at science, or just because science questions are hard? There's a way to further decompose the Brier score. You can break up the resolution as Refinement = Uncertainty - Resolution. Roughly speaking, Uncertainty is "how hard questions are", while Resolution is "how confident you were, once calibration and uncertainty are accounted for". Here's the uncertainty for different categories: [uncertaint] And here's a scatterplot of the calibration and resolution for each category: (Since more resolution is better, it's now the upper-left that contains better predictions.) [82549bee82] Overall, this further decomposition doesn't change much. This suggests GPT-4 really is better at making predictions for politics than for science or technology, even once the hardness of the questions are accounted for. P.S. The relative merits of different Brier score decompositions caused an amazing amount of internal strife during the making of this post. I had no idea I could feel so strongly about mundane technical choices. I guess I now have an exciting new category of enemies. AI Frontpage 49 New Comment Submit 9 comments, sorted by top scoring Click to highlight new comments since: Today at 11:00 PM [-]Tao Lin1d63 Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Also forecasting is just hard. So I'd expect chat models to ~always fail, base models to fail slightly less, but i'd expect finetuned models (on a somewhat large dataset) to be somewhat useful. Reply [-]dynomight1d20 Chat or instruction finetuned models have poor prediction cailbration, whereas base models (in some cases) have perfect calibration. Tell me if I understand the idea correctly: Log-loss to predict next token leads to good calibration for single token prediction, which manifests as good calibration percentage predictions? But then RLHF is some crazy loss totally removed from calibration that destroys all that? If I get that right, it seems quite intuitive. Do you have any citations, though? Reply [-]ReaderM1d50 https://imgur.com/a/3gYel9r https://openai.com/research/gpt-4 Reply [surprise]1 [-]Sune1d24 I don't find it intuitive at all. It would be intuitive if you started by telling a story describing the situation and asked the LLM to continue the story, and you then sampled randomly from the continuations and counted how many of the continuations would lead to a positive resolution of the question. This should be well-calibrated, (assuming the details included in the prompt were representative and that there isn't a bias of which types of ending the stories are in the training data for the LLM). But this is not what is happing. Instead the model outputs a token which is a number, and somehow that number happens to be well-calibrated. I guess that should mean that the prediction make in the training data are well-calibrated? That just seems very unlikely. Reply [-]justinpombrio6h20 Yeah, exactly. For example, if humans had a convention of rounding probabilities to the nearest 10% when writing them, then baseline GPT-4 would follow that convention and it would put a cap on the maximum calibration it could achieve. Humans are badly calibrated (right?) and baseline GPT-4 is mimicking humans, so why is it well calibrated? It doesn't follow from its token stream being well calibrated relative to text. Reply [-]dschwarz1d20 Great post! | Manifold markets that were resolved after GPT-4's current knowledge cutoff of Jan 1, 2022 Were you able to verify that newer knowledge didn't bleed in? Anecdotally GPT-4 can report various different cutoff dates, depending on the API. And there is anecdotal evidence that GPT-4-0314 occasionally knows about major world events after its training window, presumably from RLHF? This could explain the better scores on politics than science. Reply [-]dynomight1d30 Sadly, no--we had no way to verify that. I guess one way you might try to confirm/refute the idea of data leakage would be to look at the decomposition of brier scores: GPT-4 is much better calibrated for politics vs. science but only very slightly better at politics vs. science in terms of refinement/ resolution. Intuitively, I'd expect data leakage to manifest as better refinement/resolution rather than better calibration. Reply [-]tenthkrige9h10 Very interesting! From eyeballing the graphs, it looks like the average Brier score is barely below 0.25. This indicates that GPT-4 is better than a dart-throwing monkey (i.e. predicting a random %age, score of 0.33), and barely better than chance (always predicting 50%, score of 0.25). It would be interesting to see the decompositions for those two naive strategies for that set of questions, and compare to the sub-scores GPT-4 got. You could also check if GPT-4 is significantly better than chance. Reply [-]cartgr1d10 Why didn't you look at the token probabilities for the "yes" and "no" tokens? Reply Moderation Log