[HN Gopher] Evaluating publicly available LLMs on IMO 2025
       ___________________________________________________________________
        
       Evaluating publicly available LLMs on IMO 2025
        
       Author : hardmaru
       Score  : 67 points
       Date   : 2025-07-19 14:23 UTC (8 hours ago)
        
 (HTM) web link (matharena.ai)
 (TXT) w3m dump (matharena.ai)
        
       | blendergeek wrote:
       | Related: https://news.ycombinator.com/item?id=44613840
        
         | untitled2 wrote:
         | Exactly. Whom to believe?
        
           | changoplatanero wrote:
           | Both are true. One spent $400 in compute and the other one
           | spent a lot more.
        
             | masterjack wrote:
             | Exactly. And presumably had a more sophisticated harness
             | around the model, longer reasoning chains, best of N, self
             | judging, etc
        
           | JohnKemeny wrote:
           | The last time someone claimed a medal in an olympiad like
           | this, turned out they manually translated the problem into
           | Lean and then ran a brute force search algorithm to find a
           | proof. For 60 hours. On a supercomputer.
           | 
           | Meanwhile high schoolers get a piece of paper and 4.5 hours.
        
             | wslh wrote:
             | Even though chess is now effectively solved against human
             | players, I still remember Kasparov's suspicion that one of
             | Deep Blue's moves had a human touch. It was never proven or
             | disproven, but I trust Kasparov's deep intuition amplified
             | by Kasparov requesting access to Deep Blue's logs, and IBM
             | refusing to share them in full. For more discussions see
             | [1][2][3].
             | 
             | [1] https://chess.stackexchange.com/questions/9959/did-
             | deep-blue...
             | 
             | [2] https://nautil.us/why-the-chess-computer-deep-blue-
             | played-li...
             | 
             | [3] https://en.chessbase.com/post/deep-blue-s-cheating-move
        
             | throwawaymaths wrote:
             | kinda wild that an llm cant translate to lean?
        
           | kenjackson wrote:
           | OpenAI achieved Gold on an unreleased model. GPT-5. Read the
           | tweets and they explain what they did.
        
             | idiotsecant wrote:
             | Actually, I did it a year ago but I just don't want to
             | release my model.
        
               | senkora wrote:
               | Where should I address the billion dollar check?
        
               | emp17344 wrote:
               | My buddy did it 5 years ago. You wouldn't know him, he
               | lives in Canada.
        
               | souldeux wrote:
               | my model goes to a different school
        
               | esafak wrote:
               | The dog ate mine. And the solution didn't fit in the
               | margin, anyway.
        
             | e1g wrote:
             | OpenAI explicitly said it's not GPT-5 but another
             | experimental research model
             | https://x.com/alexwei_/status/1946477756738629827?s=46
        
               | kenjackson wrote:
               | Thanks. I parsed that wrong. In either case not the same
               | thing Math Arena used.
        
         | raincole wrote:
         | Note that it's two different things:
         | 
         | This OP claims the publicly available models all failed to get
         | Bronze.
         | 
         | OpenAI tweet claims there is an unreleased model that can get
         | Gold.
        
           | dmitrygr wrote:
           | My (unreleased) cat did even better than the OpenAI model. No
           | you cannot see. Yes you have to trust me. Now gimme more
           | money.
        
             | raincole wrote:
             | I don't know the details (of course, it's unreleased), but
             | note that MathArena evaluated " _average_ of 4 attempts ",
             | and limited token usages to 64k.
             | 
             | OpenAI likely had unlimited tokens, and evaluated " _best_
             | of N attempts. "
        
             | amelius wrote:
             | That's a claim that is far less plausible. OpenAI could
             | have thrown more resources at the problem and I would be
             | surprised if that didn't improve the results.
        
             | klabb3 wrote:
             | Wow, that's incredible. Cats are progressing so fast,
             | especially unreleased cats seem to be doing much better. My
             | two orange kitties aren't doing well on math problems but
             | obviously that's because I'm not prompting the right way -
             | any day now. If I ever get it to work, I'll be sure to
             | share the achievements on X, while carefully avoiding
             | explaining how I did it or provide any data that can
             | corroborate the claims.
        
           | sigmoid10 wrote:
           | I'd also be highly wary of the method they used because of
           | statements like this:
           | 
           | >we note that the vast majority of its answers simply stated
           | the final answer without additional justification
           | 
           | While the reasoning steps are obviously important for judging
           | human participant answers, none of the current big-game
           | providers disclose their actual reasoning tokens. So unless
           | they got direct internal access to these models from the big
           | companies (which seems highly unlikely), this might be yet
           | another failed study designed to (of which we have seen
           | several in recent months, even by serious parties).
        
           | bgwalter wrote:
           | The model did not fit in the margin.
           | 
           | We'll never know how many GPUs and other assistance (like
           | custom code paths) this model got.
        
       | chvid wrote:
       | In a few months (weeks, days - maybe it has already happened)
       | models will have much better performance on this test.
       | 
       | Not because of actual increased "intelligence" but because the
       | test would be included in model's training data - either directly
       | or indirectly where model developers "tune" their model to give
       | better performance on this particular attention driving test.
        
         | sorokod wrote:
         | From the post: "Evaluation began immediately after the 2025 IMO
         | problems were released to prevent contamination."
         | 
         | Doe this address your concern?
        
           | os2warpman wrote:
           | What they mean is that in a couple of weeks there are going
           | to be stories titled "LLMS NOW BETTER THAN HUMANS AT 2025
           | INTERNATIONAL MATH OLYMPIAD" (stories published as thinly-
           | veiled investment solicitations) but in reality they're still
           | shitty-- they've just had the answers fed in to be spit back
           | out.
        
             | sorokod wrote:
             | Companies would game metrics whenever they have the
             | opportunity. What else is new?
        
               | esafak wrote:
               | I suppose what's new is that the models aren't as smart
               | as their companies claimed.
        
           | chvid wrote:
           | Not really.
        
         | yunwal wrote:
         | Luckily there's a new set of problems every year
        
           | chvid wrote:
           | You can really only do a fair reproducible test if the models
           | are static and not sitting behind an api where you have no
           | idea on how they are updated or continuously tweaked.
        
             | chvid wrote:
             | This particular test is heralded as some sort of
             | breakthrough and the companies in this field are raising
             | billions of dollars from investors and paying their star
             | employees tens of millions.
             | 
             | The economic incentives to tweak, tune, or cheat are
             | through the roof.
        
       | WD-42 wrote:
       | > Gemini 2.5 Pro achieved the highest score with an average of
       | 31% (13 points). While this may seem low, especially considering
       | the $400 spent on generating just 24 answers
       | 
       | What? That's some serious cash for mostly wrong answers.
        
         | john-h-k wrote:
         | The time investment a human has to make to get 31% on the IMO
         | is worth far more than $400
        
           | WD-42 wrote:
           | The human still has to put in that time. How would you know
           | what 31% is correct?
        
       | wiremine wrote:
       | How quickly we shift our expectations. If you told me 5 years ago
       | we'd have technology that can do this, I wouldn't believe you.
       | 
       | This isn't to say we shouldn't think critically about the use and
       | performance of models, but "Not Even Bronze..." turned me off to
       | this critique.
        
         | raincole wrote:
         | In 2024 AlphaProof got Silver level, so people righteously
         | expect a lot now.
         | 
         | (It's specifically trained on formalized math problems, unlike
         | most LLM, so it's not an apple to apple comparison.)
        
         | wat10000 wrote:
         | LLMs are really good with words and kind of crap at "thinking."
         | Humans are wired to see these two things as tightly connected.
         | A machine that thinks poorly and talks great is inherently
         | confusing. A lot of discussion and disputes around LLMs comes
         | down to this.
         | 
         | It wasn't that long ago that the Turing Test was seen as the
         | gold standard of whether a machine was actually intelligent.
         | LLMs blew past that benchmark a year or two ago and people
         | barely noticed. This might be moving the goalposts, but I see
         | it as a realization that thought and language are less
         | inherently connected than we thought.
         | 
         | So yeah, the fact that they even do this well is pretty
         | amazing, but they sound like they should be doing so much
         | better.
        
           | thaumasiotes wrote:
           | > LLMs are really good with words and kind of crap at
           | "thinking." Humans are wired to see these two things as
           | tightly connected. A machine that thinks poorly and talks
           | great is inherently confusing. A lot of discussion and
           | disputes around LLMs comes down to this.
           | 
           | It's not an unfamiliar phenomenon in humans. Look at Malcolm
           | Gladwell.
        
       | ipsin wrote:
       | I was hoping to see the questions (which I can probably find
       | online), but also the answers from models and the judge's scores!
       | Am I missing a link? Without that I can't tell whether I should
       | be impressed or not.
        
         | raincole wrote:
         | https://matharena.ai/
         | 
         | On their website you can see the full answers LLM gave ("click
         | cells to see...")
        
       | gcanyon wrote:
       | 99.99+% of all problems humans face do not require particularly
       | original solutions. Determining whether LLMs can solve truly
       | original (or at least obscure) problems is interesting, and a
       | problem worth solving, but ignores the vast majority of the
       | (near-term at least) impact they will have.
        
         | lottin wrote:
         | 15 years ago they were predicting that AI would turn everything
         | upside down in 15 years time. It hasn't.
        
           | HEmanZ wrote:
           | People who say this don't understand the breakthrough we had
           | in the last couple of years. 15 years ago I was laughing at
           | people predicting AI would turn everything upside down soon.
           | I'm not laughing anymore. I've been around long enough to see
           | some AI hype cycles and this time it is different.
           | 
           | 15 years ago I, working on AI systems at a FAANG, would have
           | told you "real" AI probably wasn't coming in my lifetime. 15
           | years ago the only engineers I knew who thought AI was coming
           | soon were dreamers and Silicon Valley koolaiders. The rest of
           | us saw we needed a step-function break through that may not
           | even exist. But it did, and we got there, a couple of years
           | ago.
           | 
           | Now I'm telling people it's here. We've hit a completely
           | different kind of technology, and it's so clear to people
           | working in the field. The earthquake has happened and the
           | tsunami is coming.
        
             | csa wrote:
             | Thank you for sharing your experience. It makes the impact
             | of the recent advances palpable.
        
         | wat10000 wrote:
         | I really doubt a contest for high schoolers contains any truly
         | original problems.
        
         | Barrin92 wrote:
         | the value of human beings isn't in their capacity to do routine
         | tasks but to respond with some common sense to all the critical
         | issues in the 2% at the tail.
         | 
         | This is why original problems are important, it's a measure of
         | how sensible something is in an open-ended environment, and
         | here they're completely useless, not just because they fail but
         | how they fail. The fact that these LLMS according to the
         | article "invent non-existent math theorems", i.e. gibberish
         | instead of even being able to know what they don't know, is an
         | indication of how limited this still is.
        
         | wavemode wrote:
         | To be frank, I take precisely the opposite view. Most people
         | solve novel problems every day, mostly without thinking much
         | about it. Our inability to perceive the immense complexity of
         | the things we do every day is merely due to familiarity. In
         | other words we're blind to the details because our brain
         | handles them automatically, not because they don't exist.
         | 
         | Software engineers understand this better than most -
         | describing a task in general terms, and doing it yourself, can
         | be incredibly easy, even while writing the code to automate the
         | task is difficult or impossible, because of all the devilish
         | details we don't often think about.
        
       | magicalhippo wrote:
       | One interesting takeaway for me, a non-practitioner, was that the
       | models appears to be fairy decent at judging their own output.
       | 
       | They used best-of-32 and used the same model to judge a
       | "tournament" to find the best answer. Seems like something that
       | could be boltet on reasonably easy, eg in say WebUI.
       | 
       | edit: forgot to add that I'm curious if this translates to
       | smaller models as well, or if it requires these huge models.
        
       | ysofunny wrote:
       | this makes me really wonder about what is the underlying
       | practical mathematical skill?
       | 
       | intuition????
        
         | samat wrote:
         | plus a little of skills
        
       | wrsh07 wrote:
       | > Each model was run with the recommended hyperparameters and a
       | maximum token limit of 64,000. No models needs more than this
       | number of tokens
       | 
       | I'm a little confused by this. My assumptions (possibly
       | incorrect!): 64k tokens per prompt, they are claiming the model
       | wouldn't need more tokens even for reasoning
       | 
       | Is that right? Would be helpful to see how many tokens the models
       | actually used.
        
         | throwawaymaths wrote:
         | they didn't even do a (non-ml) agentic descent? like have a
         | quicky api that requeries itself generating new context?
         | 
         | "ok here is my strategy here are the five steps", then requery
         | with a strategy or proof of step 1, 2, 3...
         | 
         | in a dfs
        
       | akomtu wrote:
       | Easy benchmark that's hard to fake: data compression.
       | Intelligence is largely about creating compact predictive models
       | and so is data compression. The output should be a program
       | generating the sequence or the dataset, based on entry id or
       | nearby data points. Typical LLM bullshit won't work here because
       | the output isn't English prose that can fool a human.
        
       | esjeon wrote:
       | > For Problem 5, models often identified the correct strategies
       | but failed to prove them, which is, ironically, the easier part
       | for an IMO participant. This contrast ... suggests that models
       | could improve significantly in the near future if these
       | relatively minor logical issues are addressed.
       | 
       | Interesting but I'm not sure if this is really due to "minor
       | logical issues". This sounds like a failure due to the lack of
       | the actual understanding (the world model problem). Perhaps the
       | actual answers from AIs might have some hints, but I can't find
       | them.
       | 
       | (EDIT: ooops, found the output on the _main_ page of their
       | website. Didn 't expect that.)
       | 
       | > Best-of-n is Important ... the models are surprisingly
       | effective at identifying the relative quality of their own
       | outputs during the best-of-n selection process and are able to
       | look past coherence to check for accuracy.
       | 
       | Yes, it's always easier to be a backseat driver.
        
         | Lerc wrote:
         | > _Yes, it 's always easier to be a backseat driver_
         | 
         | Any model that can identify the correct answer reliably can
         | arrive at the correct answer given enough time and
         | stochasticity.
        
       | daedrdev wrote:
       | Here are the IMO problems if you want to give them a try:
       | 
       | https://www.imo-official.org/year_info.aspx?year=2025 (download
       | page)
       | 
       | They are very difficult.
        
       | strangescript wrote:
       | "You know that really hard test thing that most humans on the
       | planet can't do, or even understand, yeah, LLMs kind of suck at
       | it too"
       | 
       | Meanwhile Noam "well aschtually..."
       | 
       | I love how people are still betting against AI, its hilarious.
       | Please write more 2000-esk "The internet is a fad" articles
        
         | boringg wrote:
         | Its quite reasonable. We have yet to meet anything more
         | intelligent than humans so why do we think we can create
         | something more intelligent than us when we don't fully
         | understand the complexities how we work?
         | 
         | AI still has a long way to go, though it has proven to be a
         | useful tool at this point.
        
           | strangescript wrote:
           | who said anything about creating something more intelligent
           | than us, these articles have the air of "why are we wasting
           | our time on this stuff", people like gary marcus link them,
           | meanwhile they get better week over week
        
       | AndrewKemendo wrote:
       | Can someone tell me where your average every day human that's
       | walking around and has a regular job and kids and a mortgage
       | would land on this leaderboard? That's who we should be comparing
       | against.
       | 
       | The fact that the only formal comparisons for AI systems that are
       | ever done are explicitly based on the highest performing narrowly
       | focused humans, tells me how unprepared society is for what's
       | happening.
       | 
       | Appreciate that: at the point in which there is unambiguous
       | demonstration of superhuman level performance across all human
       | tasks by a machine, (and make no mistake, that *is the bar that
       | this blog post and every other post about AI sets*) it's
       | completely over for the human race; unless someone figures out an
       | entirely new economic system.
        
         | raincole wrote:
         | > average every day human
         | 
         | Average math major can't get Brozne.
        
         | pphysch wrote:
         | Machines have always had superhuman capabilities in narrow
         | domains. The LLM domain is quite broad but it's still just a
         | LLM, beholden to its training.
         | 
         | The average everyday human does not have the time to read all
         | available math texts. LLMs do, but they still can't get bronze.
         | What does that say about them?
        
         | zdragnar wrote:
         | The average person is bad at literally almost everything.
         | 
         | If I want something done, I'll seek out someone with a skill
         | set that matches the problem.
         | 
         | I don't want AI to be as good as an average person. I want AI
         | to be better than the person I would go to for help. A person
         | can talk with me, understand where I've misunderstood my own
         | problem, can point out faulty assumptions, and may even tell me
         | that the problem isn't even a problem that needs solving. A
         | person can suggest a variety of options and let me decide what
         | trade-offs I want to make.
         | 
         | If I don't trust the AI to do that, then I'm not sure why I'd
         | use it for anything other than things that don't need to be
         | done at all, unless I can justify the chance that maybe it'll
         | be done right, and I can afford the time lost getting it done
         | right without the AI afterwards.
        
           | SirFatty wrote:
           | "The average person is bad at literally almost everything."
           | 
           | Wow... that's quite a generalization. And not my experience
           | at all.
        
             | Retric wrote:
             | The average person can't play 99% of all musical
             | instruments, speak 99% of all languages, do 99% of
             | surgeries, recite 99% of all poems from memory etc.
             | 
             | We don't ask the average person to do most things, either
             | finding a specialist or providing training beforehand.
        
               | krapp wrote:
               | One cannot be bad at the things one doesn't even do. None
               | of this demonstrates that humans are bad at "literally
               | almost everything."
        
               | Retric wrote:
               | > One cannot be bad at the things one doesn't even do.
               | 
               | ??? If you don't know how to do something you're really
               | bad at it. I'm not sure what that sentence is even trying
               | to convey.
        
               | krapp wrote:
               | > Obviously you could train someone to recite the The
               | Raven from memory, but they can't do it now.
               | 
               | That doesn't make them bad at reciting The Raven from
               | memory. Being trained to recite The Raven from memory and
               | still being unable to do so would be a proper application
               | of the term. There is an obvious difference between the
               | two states of being and conflating them is specious.
               | 
               | If you want to take seriously the premise that humans are
               | bad at almost everything because most humans haven't been
               | trained at doing almost everything humans can do, then
               | you must apply the same rubric to LLMs, which are only
               | capable of expressions within their specific dataset (and
               | thus not the entire corpus of data on which they _haven
               | 't_ been trained) and even then which tend to confabulate
               | far more frequently than human beings at even simple
               | tasks.
               | 
               | edit: never mind, I guess you aren't willing to take this
               | conversation on good faith.
        
               | mysterydip wrote:
               | Didn't this start with "Can someone tell me where your
               | average every day human that's walking around and has a
               | regular job and kids and a mortgage would land on this
               | leaderboard? That's who we should be comparing against."
               | 
               | And the average person would do poorly. Not because they
               | couldn't be trained to do it, but because they haven't.
        
               | krapp wrote:
               | It's obvious that the average person would do bad at the
               | International Math Olympiad. Although I don't know why
               | the qualifiers of "regular job and kids and a mortgage"
               | are necessary, except as a weird classist signifier. I
               | strongly suspect most people on HN, who consider
               | themselves set apart from the average, with some also
               | having a regular job, kids and a mortgage, would also not
               | do well at the International Math Olympiad.
               | 
               | But that isn't the claim I'm objecting to. The claim I'm
               | objecting to is "The average person is bad at literally
               | almost everything," which is not an equivalent claim to
               | "people who aren't trained at math would be bad at math
               | at a competitive level," because it implicitly includes
               | everything that a person is trained in and is expected to
               | be qualified to do.
               | 
               | It was just bad, cynical hyperbole. And it's weird that
               | people are defending it so aggressively.
        
               | Retric wrote:
               | > it implicitly includes everything that a person is
               | trained in
               | 
               | It explicitly includes everything they are trained in,
               | but what's perhaps misleading is saying average.
               | 
               | You could phrase the same idea as, "Every individual is
               | literally bad at almost everything." But using every
               | individual is just a confusing concept here where average
               | person is a more intuitive idea.
        
               | rahimnathwani wrote:
               | It's obvious that 'bad at' in this context means
               | 'incapable of doing well'.
               | 
               | Nitpicking language doesn't help to move the
               | conversation. One thing most humans _are_ good at is
               | understanding meaning even when the speaker wasn 't
               | absolutely precise.
        
               | gundmc wrote:
               | You and the parent poster seem to be conflating the ideas
               | of:
               | 
               | - Does not have the requisite skills and experiences to
               | do X successfully
               | 
               | - Inherently does not have the capacity to do X
               | 
               | I think the former is a reasonable standard to apply in
               | this context. I'd definitely say I would be bad if I
               | tried to play the guitar, but I'm not inherently
               | incapable of doing it. It's just not very useful to say
               | "I could be good at it if I put 1000 hours of practice
               | in."
        
               | zdragnar wrote:
               | That's why there's the qualifier of "average person". If
               | one learns to play the guitar well, they are no longer
               | the average person in the context of guitar playing.
        
             | rahimnathwani wrote:
             | More than 50% of people cannot write a 'hello world'
             | program in any programming language.
             | 
             | More than 50% of people employed as software engineers
             | cannot read an academic paper in a field like education,
             | and explain whether the conclusions are sound, based on the
             | experiment description and included data.
             | 
             | More than 50% of people cannot interpret an X-ray.
        
               | csa wrote:
               | > More than 50% of people employed as software engineers
               | cannot read an academic paper in a field like education,
               | and explain whether the conclusions are sound, based on
               | the experiment description and included data.
               | 
               | I know this was meant as a dig, but I'm actually guessing
               | that software engineers score higher on this task than
               | non-engineers who hold M.Ed. degrees.
        
               | rahimnathwani wrote:
               | Agreed! Probably 3% of software could do it, vs 1% for
               | M.Ed holders.
               | 
               | The only reason I chose software engineers is because I
               | was trying to show that people who can write 'hello
               | world' programs (first example) are not good at _all_
               | intellectual tasks.
        
           | AndrewKemendo wrote:
           | Which proves my point precisely that unless you're superhuman
           | in this definition, you're obsolete.
           | 
           | Nothing new really, but there's no where left to go for human
           | labor and even that concept is being jeered at as a fantasy
           | despite this attitude.
        
         | baobabKoodaa wrote:
         | Average human would score exactly 0 at IMO.
        
         | bgwalter wrote:
         | Average humans, no. Mathematicians with enough time and a well
         | indexed database of millions of similar problems, probably.
         | 
         | We don't allow chess players to access a Syzygy tablebase in a
         | tournament.
        
         | pragmatic wrote:
         | That's not how modern societies/economies work.
         | 
         | We have specialists everywhere.
        
           | AndrewKemendo wrote:
           | My literal last sentence addresses this
        
       | bgwalter wrote:
       | So the gold medal claims in
       | https://news.ycombinator.com/item?id=44613840 look exaggerated.
       | 
       | The whole competition is unfair anyway. An "AI" has access to
       | millions of similar problems stolen and encoded in the model.
       | Humans would at least need access to a similar database; think
       | open database exam, a nuclear version of open book exam.
        
       ___________________________________________________________________
       (page generated 2025-07-19 23:00 UTC)