[HN Gopher] FrontierMath: A benchmark for evaluating advanced ma...
       ___________________________________________________________________
        
       FrontierMath: A benchmark for evaluating advanced mathematical
       reasoning in AI
        
       Author : sshroot
       Score  : 38 points
       Date   : 2024-11-09 14:18 UTC (8 hours ago)
        
 (HTM) web link (epochai.org)
 (TXT) w3m dump (epochai.org)
        
       | agucova wrote:
       | For some context on why this is important: this benchmark was
       | designed to be extremely challenging for LLMs, with problems
       | requiring several hours or days of work by expert mathematicians.
       | Currently, LLMs solve 2% of problems in the set (which is kept
       | private to prevent contamination).
       | 
       | They even provide a quote from Terence Tao, which helped create
       | the benchmark (alongside other Field medalists and IMO question
       | writers):
       | 
       | > "These are extremely challenging. I think that in the near term
       | basically the only way to solve them, short of having a real
       | domain expert in the area, is by a combination of a semi-expert
       | like a graduate student in a related field, maybe paired with
       | some combination of a modern AI and lots of other algebra
       | packages..."
       | 
       | Surprisingly, prediction markets [1] are putting 62% on AI
       | achieving > 85% performance on the benchmark before 2028.
       | 
       | [1]: https://manifold.markets/MatthewBarnett/will-an-ai-
       | achieve-8...
        
         | llm_trw wrote:
         | These benchmarks are entirely pointless.
         | 
         | The people making them are specialists attempting to apply
         | their skills to areas unrelated to LLM performance, a bit like
         | a sprinter making a training regimen for a fighter jet.
         | 
         | What matters is the data structures that underlie the problem
         | space - graph traversal. First, finding a path between two
         | nodes; second, identifying the most efficient path; and third,
         | deriving implicit nodes and edges based on a set of rules.
         | 
         | Currently, all LLMs are so limited that they struggle with
         | journeys longer than four edges, even when given a full
         | itinerary of all edges in the graph. Until they can
         | consistently manage a number of steps greater than what is
         | contained in any math proof in the training data, they aren't
         | genuinely solving these problems; they're merely regurgitating
         | memorized information.
        
           | dr_dshiv wrote:
           | > they're merely regurgitating memorized information
           | 
           | Source?
        
       ___________________________________________________________________
       (page generated 2024-11-09 23:00 UTC)