[HN Gopher] FrontierMath: A benchmark for evaluating advanced ma...
___________________________________________________________________
FrontierMath: A benchmark for evaluating advanced mathematical
reasoning in AI
Author : sshroot
Score : 38 points
Date : 2024-11-09 14:18 UTC (8 hours ago)
(HTM) web link (epochai.org)
(TXT) w3m dump (epochai.org)
| agucova wrote:
| For some context on why this is important: this benchmark was
| designed to be extremely challenging for LLMs, with problems
| requiring several hours or days of work by expert mathematicians.
| Currently, LLMs solve 2% of problems in the set (which is kept
| private to prevent contamination).
|
| They even provide a quote from Terence Tao, which helped create
| the benchmark (alongside other Field medalists and IMO question
| writers):
|
| > "These are extremely challenging. I think that in the near term
| basically the only way to solve them, short of having a real
| domain expert in the area, is by a combination of a semi-expert
| like a graduate student in a related field, maybe paired with
| some combination of a modern AI and lots of other algebra
| packages..."
|
| Surprisingly, prediction markets [1] are putting 62% on AI
| achieving > 85% performance on the benchmark before 2028.
|
| [1]: https://manifold.markets/MatthewBarnett/will-an-ai-
| achieve-8...
| llm_trw wrote:
| These benchmarks are entirely pointless.
|
| The people making them are specialists attempting to apply
| their skills to areas unrelated to LLM performance, a bit like
| a sprinter making a training regimen for a fighter jet.
|
| What matters is the data structures that underlie the problem
| space - graph traversal. First, finding a path between two
| nodes; second, identifying the most efficient path; and third,
| deriving implicit nodes and edges based on a set of rules.
|
| Currently, all LLMs are so limited that they struggle with
| journeys longer than four edges, even when given a full
| itinerary of all edges in the graph. Until they can
| consistently manage a number of steps greater than what is
| contained in any math proof in the training data, they aren't
| genuinely solving these problems; they're merely regurgitating
| memorized information.
| dr_dshiv wrote:
| > they're merely regurgitating memorized information
|
| Source?
___________________________________________________________________
(page generated 2024-11-09 23:00 UTC)