[HN Gopher] Evaluating publicly available LLMs on IMO 2025
___________________________________________________________________
Evaluating publicly available LLMs on IMO 2025
Author : hardmaru
Score : 67 points
Date : 2025-07-19 14:23 UTC (8 hours ago)
(HTM) web link (matharena.ai)
(TXT) w3m dump (matharena.ai)
| blendergeek wrote:
| Related: https://news.ycombinator.com/item?id=44613840
| untitled2 wrote:
| Exactly. Whom to believe?
| changoplatanero wrote:
| Both are true. One spent $400 in compute and the other one
| spent a lot more.
| masterjack wrote:
| Exactly. And presumably had a more sophisticated harness
| around the model, longer reasoning chains, best of N, self
| judging, etc
| JohnKemeny wrote:
| The last time someone claimed a medal in an olympiad like
| this, turned out they manually translated the problem into
| Lean and then ran a brute force search algorithm to find a
| proof. For 60 hours. On a supercomputer.
|
| Meanwhile high schoolers get a piece of paper and 4.5 hours.
| wslh wrote:
| Even though chess is now effectively solved against human
| players, I still remember Kasparov's suspicion that one of
| Deep Blue's moves had a human touch. It was never proven or
| disproven, but I trust Kasparov's deep intuition amplified
| by Kasparov requesting access to Deep Blue's logs, and IBM
| refusing to share them in full. For more discussions see
| [1][2][3].
|
| [1] https://chess.stackexchange.com/questions/9959/did-
| deep-blue...
|
| [2] https://nautil.us/why-the-chess-computer-deep-blue-
| played-li...
|
| [3] https://en.chessbase.com/post/deep-blue-s-cheating-move
| throwawaymaths wrote:
| kinda wild that an llm cant translate to lean?
| kenjackson wrote:
| OpenAI achieved Gold on an unreleased model. GPT-5. Read the
| tweets and they explain what they did.
| idiotsecant wrote:
| Actually, I did it a year ago but I just don't want to
| release my model.
| senkora wrote:
| Where should I address the billion dollar check?
| emp17344 wrote:
| My buddy did it 5 years ago. You wouldn't know him, he
| lives in Canada.
| souldeux wrote:
| my model goes to a different school
| esafak wrote:
| The dog ate mine. And the solution didn't fit in the
| margin, anyway.
| e1g wrote:
| OpenAI explicitly said it's not GPT-5 but another
| experimental research model
| https://x.com/alexwei_/status/1946477756738629827?s=46
| kenjackson wrote:
| Thanks. I parsed that wrong. In either case not the same
| thing Math Arena used.
| raincole wrote:
| Note that it's two different things:
|
| This OP claims the publicly available models all failed to get
| Bronze.
|
| OpenAI tweet claims there is an unreleased model that can get
| Gold.
| dmitrygr wrote:
| My (unreleased) cat did even better than the OpenAI model. No
| you cannot see. Yes you have to trust me. Now gimme more
| money.
| raincole wrote:
| I don't know the details (of course, it's unreleased), but
| note that MathArena evaluated " _average_ of 4 attempts ",
| and limited token usages to 64k.
|
| OpenAI likely had unlimited tokens, and evaluated " _best_
| of N attempts. "
| amelius wrote:
| That's a claim that is far less plausible. OpenAI could
| have thrown more resources at the problem and I would be
| surprised if that didn't improve the results.
| klabb3 wrote:
| Wow, that's incredible. Cats are progressing so fast,
| especially unreleased cats seem to be doing much better. My
| two orange kitties aren't doing well on math problems but
| obviously that's because I'm not prompting the right way -
| any day now. If I ever get it to work, I'll be sure to
| share the achievements on X, while carefully avoiding
| explaining how I did it or provide any data that can
| corroborate the claims.
| sigmoid10 wrote:
| I'd also be highly wary of the method they used because of
| statements like this:
|
| >we note that the vast majority of its answers simply stated
| the final answer without additional justification
|
| While the reasoning steps are obviously important for judging
| human participant answers, none of the current big-game
| providers disclose their actual reasoning tokens. So unless
| they got direct internal access to these models from the big
| companies (which seems highly unlikely), this might be yet
| another failed study designed to (of which we have seen
| several in recent months, even by serious parties).
| bgwalter wrote:
| The model did not fit in the margin.
|
| We'll never know how many GPUs and other assistance (like
| custom code paths) this model got.
| chvid wrote:
| In a few months (weeks, days - maybe it has already happened)
| models will have much better performance on this test.
|
| Not because of actual increased "intelligence" but because the
| test would be included in model's training data - either directly
| or indirectly where model developers "tune" their model to give
| better performance on this particular attention driving test.
| sorokod wrote:
| From the post: "Evaluation began immediately after the 2025 IMO
| problems were released to prevent contamination."
|
| Doe this address your concern?
| os2warpman wrote:
| What they mean is that in a couple of weeks there are going
| to be stories titled "LLMS NOW BETTER THAN HUMANS AT 2025
| INTERNATIONAL MATH OLYMPIAD" (stories published as thinly-
| veiled investment solicitations) but in reality they're still
| shitty-- they've just had the answers fed in to be spit back
| out.
| sorokod wrote:
| Companies would game metrics whenever they have the
| opportunity. What else is new?
| esafak wrote:
| I suppose what's new is that the models aren't as smart
| as their companies claimed.
| chvid wrote:
| Not really.
| yunwal wrote:
| Luckily there's a new set of problems every year
| chvid wrote:
| You can really only do a fair reproducible test if the models
| are static and not sitting behind an api where you have no
| idea on how they are updated or continuously tweaked.
| chvid wrote:
| This particular test is heralded as some sort of
| breakthrough and the companies in this field are raising
| billions of dollars from investors and paying their star
| employees tens of millions.
|
| The economic incentives to tweak, tune, or cheat are
| through the roof.
| WD-42 wrote:
| > Gemini 2.5 Pro achieved the highest score with an average of
| 31% (13 points). While this may seem low, especially considering
| the $400 spent on generating just 24 answers
|
| What? That's some serious cash for mostly wrong answers.
| john-h-k wrote:
| The time investment a human has to make to get 31% on the IMO
| is worth far more than $400
| WD-42 wrote:
| The human still has to put in that time. How would you know
| what 31% is correct?
| wiremine wrote:
| How quickly we shift our expectations. If you told me 5 years ago
| we'd have technology that can do this, I wouldn't believe you.
|
| This isn't to say we shouldn't think critically about the use and
| performance of models, but "Not Even Bronze..." turned me off to
| this critique.
| raincole wrote:
| In 2024 AlphaProof got Silver level, so people righteously
| expect a lot now.
|
| (It's specifically trained on formalized math problems, unlike
| most LLM, so it's not an apple to apple comparison.)
| wat10000 wrote:
| LLMs are really good with words and kind of crap at "thinking."
| Humans are wired to see these two things as tightly connected.
| A machine that thinks poorly and talks great is inherently
| confusing. A lot of discussion and disputes around LLMs comes
| down to this.
|
| It wasn't that long ago that the Turing Test was seen as the
| gold standard of whether a machine was actually intelligent.
| LLMs blew past that benchmark a year or two ago and people
| barely noticed. This might be moving the goalposts, but I see
| it as a realization that thought and language are less
| inherently connected than we thought.
|
| So yeah, the fact that they even do this well is pretty
| amazing, but they sound like they should be doing so much
| better.
| thaumasiotes wrote:
| > LLMs are really good with words and kind of crap at
| "thinking." Humans are wired to see these two things as
| tightly connected. A machine that thinks poorly and talks
| great is inherently confusing. A lot of discussion and
| disputes around LLMs comes down to this.
|
| It's not an unfamiliar phenomenon in humans. Look at Malcolm
| Gladwell.
| ipsin wrote:
| I was hoping to see the questions (which I can probably find
| online), but also the answers from models and the judge's scores!
| Am I missing a link? Without that I can't tell whether I should
| be impressed or not.
| raincole wrote:
| https://matharena.ai/
|
| On their website you can see the full answers LLM gave ("click
| cells to see...")
| gcanyon wrote:
| 99.99+% of all problems humans face do not require particularly
| original solutions. Determining whether LLMs can solve truly
| original (or at least obscure) problems is interesting, and a
| problem worth solving, but ignores the vast majority of the
| (near-term at least) impact they will have.
| lottin wrote:
| 15 years ago they were predicting that AI would turn everything
| upside down in 15 years time. It hasn't.
| HEmanZ wrote:
| People who say this don't understand the breakthrough we had
| in the last couple of years. 15 years ago I was laughing at
| people predicting AI would turn everything upside down soon.
| I'm not laughing anymore. I've been around long enough to see
| some AI hype cycles and this time it is different.
|
| 15 years ago I, working on AI systems at a FAANG, would have
| told you "real" AI probably wasn't coming in my lifetime. 15
| years ago the only engineers I knew who thought AI was coming
| soon were dreamers and Silicon Valley koolaiders. The rest of
| us saw we needed a step-function break through that may not
| even exist. But it did, and we got there, a couple of years
| ago.
|
| Now I'm telling people it's here. We've hit a completely
| different kind of technology, and it's so clear to people
| working in the field. The earthquake has happened and the
| tsunami is coming.
| csa wrote:
| Thank you for sharing your experience. It makes the impact
| of the recent advances palpable.
| wat10000 wrote:
| I really doubt a contest for high schoolers contains any truly
| original problems.
| Barrin92 wrote:
| the value of human beings isn't in their capacity to do routine
| tasks but to respond with some common sense to all the critical
| issues in the 2% at the tail.
|
| This is why original problems are important, it's a measure of
| how sensible something is in an open-ended environment, and
| here they're completely useless, not just because they fail but
| how they fail. The fact that these LLMS according to the
| article "invent non-existent math theorems", i.e. gibberish
| instead of even being able to know what they don't know, is an
| indication of how limited this still is.
| wavemode wrote:
| To be frank, I take precisely the opposite view. Most people
| solve novel problems every day, mostly without thinking much
| about it. Our inability to perceive the immense complexity of
| the things we do every day is merely due to familiarity. In
| other words we're blind to the details because our brain
| handles them automatically, not because they don't exist.
|
| Software engineers understand this better than most -
| describing a task in general terms, and doing it yourself, can
| be incredibly easy, even while writing the code to automate the
| task is difficult or impossible, because of all the devilish
| details we don't often think about.
| magicalhippo wrote:
| One interesting takeaway for me, a non-practitioner, was that the
| models appears to be fairy decent at judging their own output.
|
| They used best-of-32 and used the same model to judge a
| "tournament" to find the best answer. Seems like something that
| could be boltet on reasonably easy, eg in say WebUI.
|
| edit: forgot to add that I'm curious if this translates to
| smaller models as well, or if it requires these huge models.
| ysofunny wrote:
| this makes me really wonder about what is the underlying
| practical mathematical skill?
|
| intuition????
| samat wrote:
| plus a little of skills
| wrsh07 wrote:
| > Each model was run with the recommended hyperparameters and a
| maximum token limit of 64,000. No models needs more than this
| number of tokens
|
| I'm a little confused by this. My assumptions (possibly
| incorrect!): 64k tokens per prompt, they are claiming the model
| wouldn't need more tokens even for reasoning
|
| Is that right? Would be helpful to see how many tokens the models
| actually used.
| throwawaymaths wrote:
| they didn't even do a (non-ml) agentic descent? like have a
| quicky api that requeries itself generating new context?
|
| "ok here is my strategy here are the five steps", then requery
| with a strategy or proof of step 1, 2, 3...
|
| in a dfs
| akomtu wrote:
| Easy benchmark that's hard to fake: data compression.
| Intelligence is largely about creating compact predictive models
| and so is data compression. The output should be a program
| generating the sequence or the dataset, based on entry id or
| nearby data points. Typical LLM bullshit won't work here because
| the output isn't English prose that can fool a human.
| esjeon wrote:
| > For Problem 5, models often identified the correct strategies
| but failed to prove them, which is, ironically, the easier part
| for an IMO participant. This contrast ... suggests that models
| could improve significantly in the near future if these
| relatively minor logical issues are addressed.
|
| Interesting but I'm not sure if this is really due to "minor
| logical issues". This sounds like a failure due to the lack of
| the actual understanding (the world model problem). Perhaps the
| actual answers from AIs might have some hints, but I can't find
| them.
|
| (EDIT: ooops, found the output on the _main_ page of their
| website. Didn 't expect that.)
|
| > Best-of-n is Important ... the models are surprisingly
| effective at identifying the relative quality of their own
| outputs during the best-of-n selection process and are able to
| look past coherence to check for accuracy.
|
| Yes, it's always easier to be a backseat driver.
| Lerc wrote:
| > _Yes, it 's always easier to be a backseat driver_
|
| Any model that can identify the correct answer reliably can
| arrive at the correct answer given enough time and
| stochasticity.
| daedrdev wrote:
| Here are the IMO problems if you want to give them a try:
|
| https://www.imo-official.org/year_info.aspx?year=2025 (download
| page)
|
| They are very difficult.
| strangescript wrote:
| "You know that really hard test thing that most humans on the
| planet can't do, or even understand, yeah, LLMs kind of suck at
| it too"
|
| Meanwhile Noam "well aschtually..."
|
| I love how people are still betting against AI, its hilarious.
| Please write more 2000-esk "The internet is a fad" articles
| boringg wrote:
| Its quite reasonable. We have yet to meet anything more
| intelligent than humans so why do we think we can create
| something more intelligent than us when we don't fully
| understand the complexities how we work?
|
| AI still has a long way to go, though it has proven to be a
| useful tool at this point.
| strangescript wrote:
| who said anything about creating something more intelligent
| than us, these articles have the air of "why are we wasting
| our time on this stuff", people like gary marcus link them,
| meanwhile they get better week over week
| AndrewKemendo wrote:
| Can someone tell me where your average every day human that's
| walking around and has a regular job and kids and a mortgage
| would land on this leaderboard? That's who we should be comparing
| against.
|
| The fact that the only formal comparisons for AI systems that are
| ever done are explicitly based on the highest performing narrowly
| focused humans, tells me how unprepared society is for what's
| happening.
|
| Appreciate that: at the point in which there is unambiguous
| demonstration of superhuman level performance across all human
| tasks by a machine, (and make no mistake, that *is the bar that
| this blog post and every other post about AI sets*) it's
| completely over for the human race; unless someone figures out an
| entirely new economic system.
| raincole wrote:
| > average every day human
|
| Average math major can't get Brozne.
| pphysch wrote:
| Machines have always had superhuman capabilities in narrow
| domains. The LLM domain is quite broad but it's still just a
| LLM, beholden to its training.
|
| The average everyday human does not have the time to read all
| available math texts. LLMs do, but they still can't get bronze.
| What does that say about them?
| zdragnar wrote:
| The average person is bad at literally almost everything.
|
| If I want something done, I'll seek out someone with a skill
| set that matches the problem.
|
| I don't want AI to be as good as an average person. I want AI
| to be better than the person I would go to for help. A person
| can talk with me, understand where I've misunderstood my own
| problem, can point out faulty assumptions, and may even tell me
| that the problem isn't even a problem that needs solving. A
| person can suggest a variety of options and let me decide what
| trade-offs I want to make.
|
| If I don't trust the AI to do that, then I'm not sure why I'd
| use it for anything other than things that don't need to be
| done at all, unless I can justify the chance that maybe it'll
| be done right, and I can afford the time lost getting it done
| right without the AI afterwards.
| SirFatty wrote:
| "The average person is bad at literally almost everything."
|
| Wow... that's quite a generalization. And not my experience
| at all.
| Retric wrote:
| The average person can't play 99% of all musical
| instruments, speak 99% of all languages, do 99% of
| surgeries, recite 99% of all poems from memory etc.
|
| We don't ask the average person to do most things, either
| finding a specialist or providing training beforehand.
| krapp wrote:
| One cannot be bad at the things one doesn't even do. None
| of this demonstrates that humans are bad at "literally
| almost everything."
| Retric wrote:
| > One cannot be bad at the things one doesn't even do.
|
| ??? If you don't know how to do something you're really
| bad at it. I'm not sure what that sentence is even trying
| to convey.
| krapp wrote:
| > Obviously you could train someone to recite the The
| Raven from memory, but they can't do it now.
|
| That doesn't make them bad at reciting The Raven from
| memory. Being trained to recite The Raven from memory and
| still being unable to do so would be a proper application
| of the term. There is an obvious difference between the
| two states of being and conflating them is specious.
|
| If you want to take seriously the premise that humans are
| bad at almost everything because most humans haven't been
| trained at doing almost everything humans can do, then
| you must apply the same rubric to LLMs, which are only
| capable of expressions within their specific dataset (and
| thus not the entire corpus of data on which they _haven
| 't_ been trained) and even then which tend to confabulate
| far more frequently than human beings at even simple
| tasks.
|
| edit: never mind, I guess you aren't willing to take this
| conversation on good faith.
| mysterydip wrote:
| Didn't this start with "Can someone tell me where your
| average every day human that's walking around and has a
| regular job and kids and a mortgage would land on this
| leaderboard? That's who we should be comparing against."
|
| And the average person would do poorly. Not because they
| couldn't be trained to do it, but because they haven't.
| krapp wrote:
| It's obvious that the average person would do bad at the
| International Math Olympiad. Although I don't know why
| the qualifiers of "regular job and kids and a mortgage"
| are necessary, except as a weird classist signifier. I
| strongly suspect most people on HN, who consider
| themselves set apart from the average, with some also
| having a regular job, kids and a mortgage, would also not
| do well at the International Math Olympiad.
|
| But that isn't the claim I'm objecting to. The claim I'm
| objecting to is "The average person is bad at literally
| almost everything," which is not an equivalent claim to
| "people who aren't trained at math would be bad at math
| at a competitive level," because it implicitly includes
| everything that a person is trained in and is expected to
| be qualified to do.
|
| It was just bad, cynical hyperbole. And it's weird that
| people are defending it so aggressively.
| Retric wrote:
| > it implicitly includes everything that a person is
| trained in
|
| It explicitly includes everything they are trained in,
| but what's perhaps misleading is saying average.
|
| You could phrase the same idea as, "Every individual is
| literally bad at almost everything." But using every
| individual is just a confusing concept here where average
| person is a more intuitive idea.
| rahimnathwani wrote:
| It's obvious that 'bad at' in this context means
| 'incapable of doing well'.
|
| Nitpicking language doesn't help to move the
| conversation. One thing most humans _are_ good at is
| understanding meaning even when the speaker wasn 't
| absolutely precise.
| gundmc wrote:
| You and the parent poster seem to be conflating the ideas
| of:
|
| - Does not have the requisite skills and experiences to
| do X successfully
|
| - Inherently does not have the capacity to do X
|
| I think the former is a reasonable standard to apply in
| this context. I'd definitely say I would be bad if I
| tried to play the guitar, but I'm not inherently
| incapable of doing it. It's just not very useful to say
| "I could be good at it if I put 1000 hours of practice
| in."
| zdragnar wrote:
| That's why there's the qualifier of "average person". If
| one learns to play the guitar well, they are no longer
| the average person in the context of guitar playing.
| rahimnathwani wrote:
| More than 50% of people cannot write a 'hello world'
| program in any programming language.
|
| More than 50% of people employed as software engineers
| cannot read an academic paper in a field like education,
| and explain whether the conclusions are sound, based on the
| experiment description and included data.
|
| More than 50% of people cannot interpret an X-ray.
| csa wrote:
| > More than 50% of people employed as software engineers
| cannot read an academic paper in a field like education,
| and explain whether the conclusions are sound, based on
| the experiment description and included data.
|
| I know this was meant as a dig, but I'm actually guessing
| that software engineers score higher on this task than
| non-engineers who hold M.Ed. degrees.
| rahimnathwani wrote:
| Agreed! Probably 3% of software could do it, vs 1% for
| M.Ed holders.
|
| The only reason I chose software engineers is because I
| was trying to show that people who can write 'hello
| world' programs (first example) are not good at _all_
| intellectual tasks.
| AndrewKemendo wrote:
| Which proves my point precisely that unless you're superhuman
| in this definition, you're obsolete.
|
| Nothing new really, but there's no where left to go for human
| labor and even that concept is being jeered at as a fantasy
| despite this attitude.
| baobabKoodaa wrote:
| Average human would score exactly 0 at IMO.
| bgwalter wrote:
| Average humans, no. Mathematicians with enough time and a well
| indexed database of millions of similar problems, probably.
|
| We don't allow chess players to access a Syzygy tablebase in a
| tournament.
| pragmatic wrote:
| That's not how modern societies/economies work.
|
| We have specialists everywhere.
| AndrewKemendo wrote:
| My literal last sentence addresses this
| bgwalter wrote:
| So the gold medal claims in
| https://news.ycombinator.com/item?id=44613840 look exaggerated.
|
| The whole competition is unfair anyway. An "AI" has access to
| millions of similar problems stolen and encoded in the model.
| Humans would at least need access to a similar database; think
| open database exam, a nuclear version of open book exam.
___________________________________________________________________
(page generated 2025-07-19 23:00 UTC)