https://www.lesswrong.com/posts/CkhJAxHeyFCg2EcET/are-language-models-good-at-making-predictions

This website requires javascript to properly function. Consider
activating javascript to get access to all site functionality.

LESSWRONG
LW

Login

Are language models good at making predictions?

by dynomight
5 min read6th Nov 20239 comments 

49

AI 
Frontpage
Are language models good at making predictions?
Is this good?
Does it depend on the area?
Is there more to life than calibration?
Is there more to life than refinement?
9 comments
This is a linkpost for https://dynomight.net/predictions/

To get a crude answer to this question, we took 5000 questions from
Manifold markets that were resolved after GPT-4's current knowledge
cutoff of Jan 1, 2022. We gave the text of each of them to GPT-4,
along with these instructions:

    You are an expert superforecaster, familiar with the work of
    Tetlock and others. For each question in the following json
    block, make a prediction of the probability that the question
    will be resolved as true.

    Also you must determine category of the question. Some examples
    include: Sports, American politics, Science etc. Use
    make_predictions function to record your decisions. You MUST give
    a probability estimate between 0 and 1 UNDER ALL CIRCUMSTANCES.
    If for some reason you can't answer, pick the base rate, but
    return a number between 0 and 1.

This produced a big table:

             question              prediction   category    actually
                                     P(YES)                 happened?
Will the #6 Golden State Warriors
win Game 2 of the West Semifinals  0.5        Sports        YES
against the #7 LA Lakers in the
2023 NBA Playoffs?
Will Destiny's main YouTube
channel be banned before February  0.4        Social Media  NO
1st, 2023?
Will Qualy show up to EAG DC in    0.3        Entertainment NO
full Quostume?
Will I make it to a NYC airport by 0.5        Travel        YES
2pm on Saturday, the 24th?
Will this market have more Yes     0.5        Investment    CANCEL
Trades then No Trades
Will Litecoin (LTC/USD) Close      0.5        Finance       NO
Higher July 22nd Than July 21st?
Will at least 20 people come to a
New Year's Resolutions live event  0.4        Social Event  YES
on the Manifold Discord?
hmmmm {i}                          0.5        Uncategorized YES
Will there be multiple Masters     0.4        Gaming        NO
brackets in Leagues season 4?
Will the FDA approve OTC birth
control by the end of February     0.5        Health        NO
2023?
Will Max Verstappen win the 2023   0.5        Sports        YES
Formula 1 Austrian Grand Prix?
Will SBF make a tweet before Dec   0.9        Social Media  YES
31, 2022 11:59pm ET?
Will Balaji Srinivasan actually
bet $1m to 1 BTC, BEFORE 90 days   0.3        Finance       YES
pass? (June 15st, 2023)
Will a majority of the Bangalore
LessWrong/ACX meet-up attendees on 0.7        Community     YES
8th Jan 2023 find the discussion              Event
useful that day?
Will Jessica-Rose Clark beat       0.6        Sports        NO
Tainara Lisboa?
Will X (formerly twitter) censor
any registered U.S presidential    0.4        American      CANCEL
candidates before the 2024                    Politics
election?
test question                      0.5        Test          YES
stonk                              0.5        Test          YES
Will I create at least 100
additional self-described          0.8        Personal Goal YES
high-quality Manifold markets
before June 1st 2023?
Will @Gabrielle promote to ???     0.5        Career        NO
                                              Advancement
Will the Mpox (monkeypox) outbreak 0.45       Health        YES
in the US end in February 2023?
Will I have taken the GWWC pledge  0.3        Personal      NO
by Jul 1st?
FIFA U-20 World Cup - Will Uruguay
win their semi-final against       0.5        Sports        YES
Israel?
Will Manifold display the amount a
market has been tipped by end of   0.6        Technology    NO
September?

In retrospect maybe we have filtered these. Many questions are a bit
silly for our purposes, though they're typically classified as
"Test", "Uncategorized", or "Personal".

Is this good?

One way to measure if you're good at predicting stuff is to check
your calibration: When you say something has a 30% probability, does
it actually happen 30% of the time?

To check this, you need to make a lot of predictions. Then you dump
all your 30% predictions together, and see how many of them happened.

GPT-4 is not well-calibrated. [calibratio] Here, the x-axis is the
range of probabilities GPT-4 gave, broken down into bins of size 5%.
For each bin, the green line shows how often those things actually
happened. Ideally, this would match the dotted black line. For
reference, the bars show how many predictions GPT-4 gave that fell
into each of the bins. (The lines are labeled on the y-axis on the
left, while the bars are labeled on the y-axis on the right.)

At a high level, this means that GPT-4 is over-confident. When it
says something has only a 20% chance of happening, actually happens
around 35-40% of the time. When it says something has an 80% chance
of happening, it only happens around 60-75% of the time.

Does it depend on the area?

We can make the same plot for each of the 16 categories. (Remember,
these categories were decided by GPT-4, though from a spot-check,
they look accurate.) For unclear reasons, GPT-4 is well-calibrated
for questions on sports, but horrendously calibrated for "personal"
questions:

[calibratio]

All the lines look a bit noisy since there are 20 x 4 x 4 = 320 total
bins and only 5000 total observations.

Is there more to life than calibration?

Say you and I are predicting the outcome that a fair coin comes up
heads when flipped. I always predict 50%, while you always predict
either 0% or 100% and you're always right. Then we are both perfectly
calibrated. But clearly your predictions are better, because you
predicted with more confidence.

The typical way to deal with this is squared errors, or "Brier
scores". To calculate this, let the actual outcome be 1 if the thing
happened, and 0 if it didn't. Then take the average squared
difference between your probability and the actual outcome. For
example:

  * GPT-4 gave "Will SBF make a tweet before Dec 31, 2022 11:59pm ET?
    " a YES probability of 0.9. Since this actually happened, this
    corresponds to a score of (0.9-1)2 = 0.01.
  * GPT-4 gave "Will Manifold display the amount a market has been
    tipped by end of September?" a YES probability of 0.6. Since this
    didn't happen, this corresponds to a score of (0.6-0)2 = 0.36.

Here are the average scores for each category (lower is better):

[brier]

Or, if you want, you can decompose the Brier score. There are various
ways to do this, but my favorite is Brier = Calibration + Refinement.
Informally, Calibration is how close the green lines above are to the
dotted black lines, while Refinement is how confident you were. (Both
are better when smaller.)

[decomp]

You can also visualize this as a scatterplot:

[decomp-sca]

Is there more to life than refinement?

Brier scores are better for politics questions than for science
questions. But is that because it's bad at science, or just because
science questions are hard?

There's a way to further decompose the Brier score. You can break up
the resolution as Refinement = Uncertainty - Resolution. Roughly
speaking, Uncertainty is "how hard questions are", while Resolution
is "how confident you were, once calibration and uncertainty are
accounted for".

Here's the uncertainty for different categories:

[uncertaint]

And here's a scatterplot of the calibration and resolution for each
category: (Since more resolution is better, it's now the upper-left
that contains better predictions.)

[82549bee82]

Overall, this further decomposition doesn't change much. This
suggests GPT-4 really is better at making predictions for politics
than for science or technology, even once the hardness of the
questions are accounted for.

P.S. The relative merits of different Brier score decompositions
caused an amazing amount of internal strife during the making of this
post. I had no idea I could feel so strongly about mundane technical
choices. I guess I now have an exciting new category of enemies.

AI 
Frontpage

49

New Comment
Submit
9 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:00 PM
[-]Tao Lin1d63

Chat or instruction finetuned models have poor prediction
cailbration, whereas base models (in some cases) have perfect
calibration. Also forecasting is just hard. So I'd expect chat models
to ~always fail, base models to fail slightly less, but i'd expect
finetuned models (on a somewhat large dataset) to be somewhat useful.

Reply
[-]dynomight1d20


    Chat or instruction finetuned models have poor prediction
    cailbration, whereas base models (in some cases) have perfect
    calibration.

 

Tell me if I understand the idea correctly: Log-loss to predict next
token leads to good calibration for single token prediction, which
manifests as good calibration percentage predictions? But then RLHF
is some crazy loss totally removed from calibration that destroys all
that?

If I get that right, it seems quite intuitive. Do you have any
citations, though?

Reply
[-]ReaderM1d50

https://imgur.com/a/3gYel9r

https://openai.com/research/gpt-4

Reply
[surprise]1
[-]Sune1d24

I don't find it intuitive at all. It would be intuitive if you
started by telling a story describing the situation and asked the LLM
to continue the story, and you then sampled randomly from the
continuations and counted how many of the continuations would lead to
a positive resolution of the question. This should be
well-calibrated, (assuming the details included in the prompt were
representative and that there isn't a bias of which types of ending
the stories are in the training data for the LLM). But this is not
what is happing. Instead the model outputs a token which is a number,
and somehow that number happens to be well-calibrated. I guess that
should mean that the prediction make in the training data are
well-calibrated? That just seems very unlikely.

Reply
[-]justinpombrio6h20

Yeah, exactly. For example, if humans had a convention of rounding
probabilities to the nearest 10% when writing them, then baseline
GPT-4 would follow that convention and it would put a cap on the
maximum calibration it could achieve. Humans are badly calibrated
(right?) and baseline GPT-4 is mimicking humans, so why is it well
calibrated? It doesn't follow from its token stream being well
calibrated relative to text.

Reply
[-]dschwarz1d20

Great post!

| Manifold markets that were resolved after GPT-4's current knowledge
cutoff of Jan 1, 2022

Were you able to verify that newer knowledge didn't bleed in?
Anecdotally GPT-4 can report various different cutoff dates,
depending on the API. And there is anecdotal evidence that GPT-4-0314
occasionally knows about major world events after its training
window, presumably from RLHF?

This could explain the better scores on politics than science.

Reply
[-]dynomight1d30

Sadly, no--we had no way to verify that.

I guess one way you might try to confirm/refute the idea of data
leakage would be to look at the decomposition of brier scores: GPT-4
is much better calibrated for politics vs. science but only very
slightly better at politics vs. science in terms of refinement/
resolution. Intuitively, I'd expect data leakage to manifest as
better refinement/resolution rather than better calibration.

Reply
[-]tenthkrige9h10

Very interesting!

From eyeballing the graphs, it looks like the average Brier score is
barely below 0.25. This indicates that GPT-4 is better than a
dart-throwing monkey (i.e. predicting a random %age, score of 0.33),
and barely better than chance (always predicting 50%, score of 0.25).

It would be interesting to see the decompositions for those two naive
strategies for that set of questions, and compare to the sub-scores
GPT-4 got.

You could also check if GPT-4 is significantly better than chance.

Reply
[-]cartgr1d10

Why didn't you look at the token probabilities for the "yes" and "no"
tokens?

Reply
Moderation Log