[HN Gopher] A human metaphor for evaluating AI capability
___________________________________________________________________
A human metaphor for evaluating AI capability
Author : bertman
Score : 129 points
Date : 2025-07-20 08:13 UTC (14 hours ago)
(HTM) web link (mathstodon.xyz)
(TXT) w3m dump (mathstodon.xyz)
| chronic0262 wrote:
| > Related to this, I will not be commenting on any self-reported
| AI competition performance results for which the methodology was
| not disclosed in advance of the competition.
|
| what a badass
| amelius wrote:
| Yes, I think it is disingenuous of OpenAI to make ill-supported
| claims about things that can affect us in important ways,
| having an impact on our worldview, and our place in the world
| as an intelligent species. They should be corrected here, and
| TT is doing a good job.
| svat wrote:
| Great set of observations, and indeed it's worth remembering that
| the specific details of assistance and setup make a difference of
| several orders of magnitude. And ha, he edited the last post in
| the thread to add this comment:
|
| > _Related to this, I will not be commenting on any self-reported
| AI competition performance results for which the methodology was
| not disclosed in advance of the competition. (3 /3)_
|
| (This wasn't there when I first read the thread yesterday 18
| hours ago; it was edited in 15 hours ago i.e. 3 hours later.)
|
| It's one of the things to admire about Terence Tao: he's always
| insightful even when he comments about stuff outside mathematics,
| while always having the mathematician's discipline of _not_
| drawing confident conclusions when data is missing.
|
| I was reminded of this because of a recent thread where some HN
| commenter expected him to make predictions about the future
| (https://news.ycombinator.com/item?id=44356367). Also reminded of
| Sherlock Holmes (from _A Scandal in Bohemia_ ):
|
| > _"This is indeed a mystery," I remarked. "What do you imagine
| that it means?"_
|
| > _"I have no data yet. It is a capital mistake to theorize
| before one has data. Insensibly one begins to twist facts to suit
| theories, instead of theories to suit facts."_
|
| Edit: BTW, seeing some other commentary (here and elsewhere)
| about these posts is very disappointing -- even when Tao
| explicitly says he's not commenting about any specific claim
| (like that of OpenAI), many people seem to be eager to interpret
| his comments as being about that claim: people's tendency for
| tribalism / taking "sides" is so great that they want to read
| this as Tao caring about the same things they care about, rather
| than him using the just-concluded IMO as an illustration for the
| point he's actually making (that results are sensitive to
| details). In fact his previous post
| (https://mathstodon.xyz/@tao/114877789298562646) was about "There
| was not an official controlled competition set up for AI models
| for this year's IMO [...] Hopefully by next year we will have a
| controlled environment to get some scientific comparisons and
| evaluations" -- he's specifically saying we cannot compare across
| different AI models so it's hard to say anything specific, yet
| people think he's saying something specific!
| johnecheck wrote:
| My thoughts were similar. OpenAI, very cool result! Very exciting
| claim! Yet meaningless in the form of a Twitter thread with no
| real details.
| roxolotl wrote:
| This does a great job illustrating the challenges with arguing
| over these results. Those in the agi camp will argue that the
| alterations are mostly what makes the ai so powerful.
|
| Multiple days worth of processing, cross communication, picking
| only the best result? That's just the power of parallel
| processing and how they reason so well. Altering to a more
| standard prompt? Communicating with a more strict natural
| language helps reduce confusion. Calculator access and the vast
| knowledge of humanity built in? That's the whole point.
|
| I tend to side with Tao on this one but the point is less who's
| right and more why there's so much arguing past each other. The
| basic fundamentals of how to judge these tools aren't agreed
| upon.
| johnecheck wrote:
| Would be nice if we actually knew what was done so we could
| discuss how to judge it.
|
| That recent announcement might just be fluff or might be some
| real news, depending. We just don't know.
|
| I can't even read into their silence - this is exactly how much
| OpenAI would share in the totally grifting scenario _and_ in
| the massive breakthrough scenario.
| algorithms432 wrote:
| Well, they deliberately ignored the requests of IMO
| organizers to not publish AI results for some time (a week?)
| to not steal the spotlight from the actual participants, so
| clearly this announcement's purpose is creating hype. Makes
| me lean more towards the "totally grifting" scenario.
| bgwalter wrote:
| Amazing. Stealing the spotlight from High School students
| is really quite something.
|
| I'm glad that Tao has caught on. As an academic it is easy
| to assume integrity from others but there is no such thing
| in software big business.
| bluefirebrand wrote:
| > As an academic it is easy to assume integrity from
| others
|
| I'm not an academic, but from the outside looking in on
| academia I don't think academics should be so quick to
| assume integrity either
|
| There seems to be a lot of perverse incentives in
| academia to cheat, cut corners, publish at all costs, etc
| letmevoteplease wrote:
| The source of this claim is a tweet.[1] The tweet
| screencaps a mathematician who says they talked to an IMO
| board member who told them "it was the general sense of the
| Jury and Coordinators that it's rude and inappropriate for
| AI developers to make announcements about their IMO
| performances too close to the IMO." This has now morphed
| into "OpenAI deliberately ignored the requests of IMO
| organizers to not publish AI results for some time."
|
| [1] https://x.com/Mihonarium/status/1946880931723194389
| algorithms432 wrote:
| The very tweet you're referencing: "Still, the IMO
| organizers directly asked OpenAI not to announce their
| results immediately after the olympiad."
|
| (Also, here is the source of the screencap: https://leanp
| rover.zulipchat.com/#narrow/channel/219941-Mach... )
| letmevoteplease wrote:
| The tweet is not an accurate summary of the original
| post. The person who said they talked to the organizer
| did not say that. And now we are relying on a tweet from
| a person who said they talked to a person who said they
| talked to an organizer. Quite a game of telephone, and
| yet you're presenting it as some established truth.
| griffzhowl wrote:
| > Calculator access and the vast knowledge of humanity built
| in? That's the whole point.
|
| I think Tao's point was that a more appropriate comparison
| between AI and humans would be to compare it with humans that
| have calculator/internet access.
|
| I agree with your overall point though: it's not straighforward
| to specify exactly what would be an appropriate comparison
| zer00eyz wrote:
| > Those in the agi camp will argue that the alterations are
| mostly what makes the ai so powerful.
|
| And here is a group of people who is painfully unaware of
| history.
|
| Expert systems were amazing. They did what they were supposed
| to do, and well. And you could probably build better ones today
| on top of the current tech stack.
|
| Why hasn't any one done that? Because constantly having to pay
| experts to come in and assess, update, test, and measure your
| system was a burden for the result returned.
|
| Sound familiar?
|
| LLM's are completely incapable of synthesis. They are incapable
| of the complex chaining, the type that one has to do when
| working with systems that aren't well documented. Dont believe
| me: Ask an LLM to help you with build root on a newly minted
| embedded system.
|
| Go feed an LLM one of the puzzles from here:
| https://daydreampuzzles.com/logic-grid-puzzles/ -- If you want
| to make it more fun, change the names to those of killers and
| dictators and the acts to those of ones its been "told" to
| dissuade.
|
| Could we re-tool an LLM to solve these sorts of matrix style
| problems. Sure. Is that going to generalize to the same sorts
| of logic and reason matrixes that a complex state machine
| requires? Not without a major breakthrough of a nature that is
| very different to the current work.
| godelski wrote:
| > you could probably build better ones today on top of the
| current tech stack.
|
| In a way, this is being done. If you look around a little
| you'll see a bunch of jobs that pay like $50+/hr for anyone
| with a hard science degree to answer questions. This is one
| of the ways they're collecting data and trying to create new
| data.
|
| If we're saying expert systems are exclusively decision
| trees, then yeah, I think it would be a difficult argument to
| make[0]. But if you're using the general concept of a system
| that has a strong knowledge base but superficial knowledge,
| well current LLMs have very similar problems to expert
| systems[1].
|
| I'm afraid that people read this as "LLMs suck" or "LLMs are
| useless" but I don't think that at all. Expert systems are
| pretty useful, as you mention. You get better use out of your
| tools when you understand what they can and can't do. What
| they are better at and worse at, even when they can do
| things. LLMs are great, but oversold. > Go
| feed an LLM one of the puzzles from here
|
| These are also good. But mind you, both are online and have
| been for awhile. All these problems should be assumed to be
| within the training data.
| https://www.oebp.org/welcome.php
|
| [0] We'd need more interpretibility of these systems and then
| you'd have to resolve the question of if superpositioning is
| allowed in decision trees. But I don't think LLMs are just
| fancy decision trees
|
| [1] https://en.wikipedia.org/wiki/Expert_system#Disadvantages
| bwfan123 wrote:
| generally, these class of constraint satisfaction problems
| fall under the "zebra puzzle" (or einstein puzzle) umbrella
| [1]. They are interesting because they posit a world with
| some axioms, and inference procedures, and ask if a certain
| statement result from them. LLMs as-is (without provers or
| tool usage) would have a difficult time with these
| constraint-satisfaction puzzles. 3-sat is a corner-case of
| these puzzles, and if LLMs could solve them in P time, then
| we have found a constructive proof of P=NP lol !
|
| [1] https://en.wikipedia.org/wiki/Zebra_Puzzle
| zer00eyz wrote:
| > In a way, this is being done. If you look around a little
| you'll see a bunch of jobs that pay like $50+/hr for anyone
| with a hard science degree to answer questions. This is one
| of the ways they're collecting data and trying to create
| new data.
|
| This is what expert systems did, and why they fell apart.
| The cost of doing this, ongoing, forever never justified
| the results. It likely still would not even at minimum
| wage, and maybe more so because LLM's require so much more
| data.
|
| > All these problems should be assumed to be within the
| training data.
|
| And yet most models are going to fall flat on their face
| with these. "In the data" isnt enough for it to make the
| leaps to a solution.
|
| The reality is that "language" is just a representation of
| knowledge. The idea that we're going to gather enough
| examples and jump to intelligence is a mighty large
| assumption. I dont see an underlying commutative property
| at work in any of the LLM's we have today. The sooner we
| get to an understand that there is no (a)I coming, the
| sooner we can get down to building out LLM's to their full
| (if limited) potential.
| largbae wrote:
| I feel like everyone who treats AGI as "the goal" is wasting
| energy that could be applied towards real problems right now.
|
| AI in general has given humans great leverage in processing
| information, more than we have ever had before. Do we need AGI to
| start applying this wonderful leverage toward our problems as a
| species?
| d4rkn0d3z wrote:
| As a graduate student I was actually given tests that more
| closely resembled the second scenario the auther described.
| Difficult problems in GR, a whole weekend to work on them, no
| limits as to who or what references I consulted.
|
| This sounds great until you realize there are only a handful of
| people on earth that could offer any help, also the proofs you
| will write are not available in print anywhere.
|
| I asked one of those questions of Grok 4 and its response was to
| issue "an error". AFAIK, in many results quoted for AI
| performance, filling the answer box yields full marks but I would
| have recieved a big fat zero had I done the same.
| godelski wrote:
| As a physics undergraduate I had similar style tests for my
| upper division classes (the classical mechanics professor and
| loved these). We'd have like 3 days to do the test, open book,
| open internet[0] and the professor extended his office hours,
| but no help from peers. It really stretched your thinking.
| Removed the time pressure but really gave the sense of what it
| was like to be a real physicist.
|
| Even though in the last decade a lot more of that complex
| material appears online, there's still a lot that can't.
| Unfortunately, I haven't seen any AI system come close to
| answering any of these types of questions. Some look right at a
| glance but often contain major errors pretty early on.
|
| I wouldn't be surprised if an LLM can ace the Physics GRE. The
| internet is filled with the test questions and there are so few
| variations. But I'll be impressed when they can answer one of
| these types of tests. They require that you actually do world
| modeling (and not necessarily of the literal world, just the
| world that the physics problem lives in[1]). Most humans can't
| get these right without drawing diagrams. You got to pull a lot
| of different moving information together.
|
| [0] you were expected to report if you stumbled on the solution
| somewhere. No one ever found one though
|
| [1] an important distinction for those working on world models.
| What world are you modeling? Which physics are you modeling?
| bwfan123 wrote:
| Would you mind sharing a sketch of one problem from the test
| you mention ? I am interested in how it looks.
| godelski wrote:
| It's been a decade, so I don't have any of the actual tests
| anymore. But the class used Marion and Thornton's Classical
| Mechanics[0] and occasionally pulled from Goldstein's
| book[1]. It was an undergrad class, so we only pulled from
| the second in the Classical II class.
|
| For these very tough physics (and math) problems usually
| the most complex part is just getting started. Sure, there
| would always be some complex weird calculation that needs
| to be done, but often by the time you get to there you have
| a general knowledge of what actually needs to be solved and
| that gives you a lot of clues. For the classical we were
| usually concerned with deriving the Hamiltonian of the
| system[2]. By no means is the computation easy, but I found
| (and this seemed to be common) that the hardest part was
| getting everything set up and ensuring you have an accurate
| description which to derive from. Small differences can be
| killer and that was often the point. There are a lot of
| tools that give you a kind of "sniff test" as to if you've
| accounted for everything or not, but many of these are not
| available until you've already gotten through a good chunk
| of computation (or all the way!). Which, tbh, is really the
| hard part of doing science. It is the attention to detail,
| the nuances. Which should make sense, as if this didn't
| matter we'd have solved everything long ago, right?
|
| I mean in the experiment section of my optics class we also
| were tested on things like just setting up a laser so that
| it would properly lase. I was one of two people that could
| reliably do it in my cohort. You had to be very meticulous
| and constantly thinking about how the one part you're
| working with is interacting with the system as a whole. Not
| to mention the poor tolerances of our lab equipment lol.
|
| Really, a lot of it comes down to world modeling. I'm an AI
| researcher now and I think a lot of people really are
| oversimplifying what this term actually means. Like many of
| those physics problems, it looks simple at face value but
| it isn't until you get into the depth that you see the
| beauty and complexity of it all.[3]
|
| [0] https://www.amazon.com/Classical-Dynamics-Particles-
| Systems-...
|
| [1] https://www.amazon.com/Classical-Mechanics-3rd-Herbert-
| Golds...
|
| [2] Once you're out of basic physics classes you usually
| don't care about numbers. It is all about symbolic
| manipulation. The point of physics is to generate causal
| explanations, ones that are counterfactual. So you are
| mainly interested in the description of the system because
| from there you can plug in any numbers you wish. Joke is
| that you do this then hand it off to the engineer or
| computer.
|
| [3] A pet peeve of mine is that people will say "I just
| care that it works." I hate this because it is a shared
| goal no matter your belief about approach (who doesn't want
| it to work?! What an absurd dichotomy). The people that
| think the AI system needs to derive (learn) realistic
| enough laws of physics are driven because they are
| explicitly concerned with things working. It's not about
| "theory" as it is that this is a requirement for having a
| generalizable solution. They understand how these subtle
| differences _quickly_ cascade into big differences. I mean
| your basic calculus level physics is good enough for a
| spherical chicken in a vacuum but it gets much more complex
| when you want to operate in the real world. Unfortunately
| there aren 't things that can be determined purely through
| observation (even in a purely mechanical universe).
| ants_everywhere wrote:
| I like this approach in general for understanding AI tools.
|
| There are lots of things computers can do that humans can't, like
| spawn N threads to complete a calculation. You _can_ fill a room
| with N human calculators and combine their results.
|
| If your goal is to just understand the raw performance of the AI
| as a tool, then this distinction doesn't really matter. But if
| you want to compare the performance of the AI on a task against
| the performance of an individual human you have to control the
| relevant variables.
___________________________________________________________________
(page generated 2025-07-20 23:01 UTC)