[HN Gopher] A human metaphor for evaluating AI capability
       ___________________________________________________________________
        
       A human metaphor for evaluating AI capability
        
       Author : bertman
       Score  : 129 points
       Date   : 2025-07-20 08:13 UTC (14 hours ago)
        
 (HTM) web link (mathstodon.xyz)
 (TXT) w3m dump (mathstodon.xyz)
        
       | chronic0262 wrote:
       | > Related to this, I will not be commenting on any self-reported
       | AI competition performance results for which the methodology was
       | not disclosed in advance of the competition.
       | 
       | what a badass
        
         | amelius wrote:
         | Yes, I think it is disingenuous of OpenAI to make ill-supported
         | claims about things that can affect us in important ways,
         | having an impact on our worldview, and our place in the world
         | as an intelligent species. They should be corrected here, and
         | TT is doing a good job.
        
       | svat wrote:
       | Great set of observations, and indeed it's worth remembering that
       | the specific details of assistance and setup make a difference of
       | several orders of magnitude. And ha, he edited the last post in
       | the thread to add this comment:
       | 
       | > _Related to this, I will not be commenting on any self-reported
       | AI competition performance results for which the methodology was
       | not disclosed in advance of the competition. (3 /3)_
       | 
       | (This wasn't there when I first read the thread yesterday 18
       | hours ago; it was edited in 15 hours ago i.e. 3 hours later.)
       | 
       | It's one of the things to admire about Terence Tao: he's always
       | insightful even when he comments about stuff outside mathematics,
       | while always having the mathematician's discipline of _not_
       | drawing confident conclusions when data is missing.
       | 
       | I was reminded of this because of a recent thread where some HN
       | commenter expected him to make predictions about the future
       | (https://news.ycombinator.com/item?id=44356367). Also reminded of
       | Sherlock Holmes (from _A Scandal in Bohemia_ ):
       | 
       | > _"This is indeed a mystery," I remarked. "What do you imagine
       | that it means?"_
       | 
       | > _"I have no data yet. It is a capital mistake to theorize
       | before one has data. Insensibly one begins to twist facts to suit
       | theories, instead of theories to suit facts."_
       | 
       | Edit: BTW, seeing some other commentary (here and elsewhere)
       | about these posts is very disappointing -- even when Tao
       | explicitly says he's not commenting about any specific claim
       | (like that of OpenAI), many people seem to be eager to interpret
       | his comments as being about that claim: people's tendency for
       | tribalism / taking "sides" is so great that they want to read
       | this as Tao caring about the same things they care about, rather
       | than him using the just-concluded IMO as an illustration for the
       | point he's actually making (that results are sensitive to
       | details). In fact his previous post
       | (https://mathstodon.xyz/@tao/114877789298562646) was about "There
       | was not an official controlled competition set up for AI models
       | for this year's IMO [...] Hopefully by next year we will have a
       | controlled environment to get some scientific comparisons and
       | evaluations" -- he's specifically saying we cannot compare across
       | different AI models so it's hard to say anything specific, yet
       | people think he's saying something specific!
        
       | johnecheck wrote:
       | My thoughts were similar. OpenAI, very cool result! Very exciting
       | claim! Yet meaningless in the form of a Twitter thread with no
       | real details.
        
       | roxolotl wrote:
       | This does a great job illustrating the challenges with arguing
       | over these results. Those in the agi camp will argue that the
       | alterations are mostly what makes the ai so powerful.
       | 
       | Multiple days worth of processing, cross communication, picking
       | only the best result? That's just the power of parallel
       | processing and how they reason so well. Altering to a more
       | standard prompt? Communicating with a more strict natural
       | language helps reduce confusion. Calculator access and the vast
       | knowledge of humanity built in? That's the whole point.
       | 
       | I tend to side with Tao on this one but the point is less who's
       | right and more why there's so much arguing past each other. The
       | basic fundamentals of how to judge these tools aren't agreed
       | upon.
        
         | johnecheck wrote:
         | Would be nice if we actually knew what was done so we could
         | discuss how to judge it.
         | 
         | That recent announcement might just be fluff or might be some
         | real news, depending. We just don't know.
         | 
         | I can't even read into their silence - this is exactly how much
         | OpenAI would share in the totally grifting scenario _and_ in
         | the massive breakthrough scenario.
        
           | algorithms432 wrote:
           | Well, they deliberately ignored the requests of IMO
           | organizers to not publish AI results for some time (a week?)
           | to not steal the spotlight from the actual participants, so
           | clearly this announcement's purpose is creating hype. Makes
           | me lean more towards the "totally grifting" scenario.
        
             | bgwalter wrote:
             | Amazing. Stealing the spotlight from High School students
             | is really quite something.
             | 
             | I'm glad that Tao has caught on. As an academic it is easy
             | to assume integrity from others but there is no such thing
             | in software big business.
        
               | bluefirebrand wrote:
               | > As an academic it is easy to assume integrity from
               | others
               | 
               | I'm not an academic, but from the outside looking in on
               | academia I don't think academics should be so quick to
               | assume integrity either
               | 
               | There seems to be a lot of perverse incentives in
               | academia to cheat, cut corners, publish at all costs, etc
        
             | letmevoteplease wrote:
             | The source of this claim is a tweet.[1] The tweet
             | screencaps a mathematician who says they talked to an IMO
             | board member who told them "it was the general sense of the
             | Jury and Coordinators that it's rude and inappropriate for
             | AI developers to make announcements about their IMO
             | performances too close to the IMO." This has now morphed
             | into "OpenAI deliberately ignored the requests of IMO
             | organizers to not publish AI results for some time."
             | 
             | [1] https://x.com/Mihonarium/status/1946880931723194389
        
               | algorithms432 wrote:
               | The very tweet you're referencing: "Still, the IMO
               | organizers directly asked OpenAI not to announce their
               | results immediately after the olympiad."
               | 
               | (Also, here is the source of the screencap: https://leanp
               | rover.zulipchat.com/#narrow/channel/219941-Mach... )
        
               | letmevoteplease wrote:
               | The tweet is not an accurate summary of the original
               | post. The person who said they talked to the organizer
               | did not say that. And now we are relying on a tweet from
               | a person who said they talked to a person who said they
               | talked to an organizer. Quite a game of telephone, and
               | yet you're presenting it as some established truth.
        
         | griffzhowl wrote:
         | > Calculator access and the vast knowledge of humanity built
         | in? That's the whole point.
         | 
         | I think Tao's point was that a more appropriate comparison
         | between AI and humans would be to compare it with humans that
         | have calculator/internet access.
         | 
         | I agree with your overall point though: it's not straighforward
         | to specify exactly what would be an appropriate comparison
        
         | zer00eyz wrote:
         | > Those in the agi camp will argue that the alterations are
         | mostly what makes the ai so powerful.
         | 
         | And here is a group of people who is painfully unaware of
         | history.
         | 
         | Expert systems were amazing. They did what they were supposed
         | to do, and well. And you could probably build better ones today
         | on top of the current tech stack.
         | 
         | Why hasn't any one done that? Because constantly having to pay
         | experts to come in and assess, update, test, and measure your
         | system was a burden for the result returned.
         | 
         | Sound familiar?
         | 
         | LLM's are completely incapable of synthesis. They are incapable
         | of the complex chaining, the type that one has to do when
         | working with systems that aren't well documented. Dont believe
         | me: Ask an LLM to help you with build root on a newly minted
         | embedded system.
         | 
         | Go feed an LLM one of the puzzles from here:
         | https://daydreampuzzles.com/logic-grid-puzzles/ -- If you want
         | to make it more fun, change the names to those of killers and
         | dictators and the acts to those of ones its been "told" to
         | dissuade.
         | 
         | Could we re-tool an LLM to solve these sorts of matrix style
         | problems. Sure. Is that going to generalize to the same sorts
         | of logic and reason matrixes that a complex state machine
         | requires? Not without a major breakthrough of a nature that is
         | very different to the current work.
        
           | godelski wrote:
           | > you could probably build better ones today on top of the
           | current tech stack.
           | 
           | In a way, this is being done. If you look around a little
           | you'll see a bunch of jobs that pay like $50+/hr for anyone
           | with a hard science degree to answer questions. This is one
           | of the ways they're collecting data and trying to create new
           | data.
           | 
           | If we're saying expert systems are exclusively decision
           | trees, then yeah, I think it would be a difficult argument to
           | make[0]. But if you're using the general concept of a system
           | that has a strong knowledge base but superficial knowledge,
           | well current LLMs have very similar problems to expert
           | systems[1].
           | 
           | I'm afraid that people read this as "LLMs suck" or "LLMs are
           | useless" but I don't think that at all. Expert systems are
           | pretty useful, as you mention. You get better use out of your
           | tools when you understand what they can and can't do. What
           | they are better at and worse at, even when they can do
           | things. LLMs are great, but oversold.                 > Go
           | feed an LLM one of the puzzles from here
           | 
           | These are also good. But mind you, both are online and have
           | been for awhile. All these problems should be assumed to be
           | within the training data.
           | https://www.oebp.org/welcome.php
           | 
           | [0] We'd need more interpretibility of these systems and then
           | you'd have to resolve the question of if superpositioning is
           | allowed in decision trees. But I don't think LLMs are just
           | fancy decision trees
           | 
           | [1] https://en.wikipedia.org/wiki/Expert_system#Disadvantages
        
             | bwfan123 wrote:
             | generally, these class of constraint satisfaction problems
             | fall under the "zebra puzzle" (or einstein puzzle) umbrella
             | [1]. They are interesting because they posit a world with
             | some axioms, and inference procedures, and ask if a certain
             | statement result from them. LLMs as-is (without provers or
             | tool usage) would have a difficult time with these
             | constraint-satisfaction puzzles. 3-sat is a corner-case of
             | these puzzles, and if LLMs could solve them in P time, then
             | we have found a constructive proof of P=NP lol !
             | 
             | [1] https://en.wikipedia.org/wiki/Zebra_Puzzle
        
             | zer00eyz wrote:
             | > In a way, this is being done. If you look around a little
             | you'll see a bunch of jobs that pay like $50+/hr for anyone
             | with a hard science degree to answer questions. This is one
             | of the ways they're collecting data and trying to create
             | new data.
             | 
             | This is what expert systems did, and why they fell apart.
             | The cost of doing this, ongoing, forever never justified
             | the results. It likely still would not even at minimum
             | wage, and maybe more so because LLM's require so much more
             | data.
             | 
             | > All these problems should be assumed to be within the
             | training data.
             | 
             | And yet most models are going to fall flat on their face
             | with these. "In the data" isnt enough for it to make the
             | leaps to a solution.
             | 
             | The reality is that "language" is just a representation of
             | knowledge. The idea that we're going to gather enough
             | examples and jump to intelligence is a mighty large
             | assumption. I dont see an underlying commutative property
             | at work in any of the LLM's we have today. The sooner we
             | get to an understand that there is no (a)I coming, the
             | sooner we can get down to building out LLM's to their full
             | (if limited) potential.
        
       | largbae wrote:
       | I feel like everyone who treats AGI as "the goal" is wasting
       | energy that could be applied towards real problems right now.
       | 
       | AI in general has given humans great leverage in processing
       | information, more than we have ever had before. Do we need AGI to
       | start applying this wonderful leverage toward our problems as a
       | species?
        
       | d4rkn0d3z wrote:
       | As a graduate student I was actually given tests that more
       | closely resembled the second scenario the auther described.
       | Difficult problems in GR, a whole weekend to work on them, no
       | limits as to who or what references I consulted.
       | 
       | This sounds great until you realize there are only a handful of
       | people on earth that could offer any help, also the proofs you
       | will write are not available in print anywhere.
       | 
       | I asked one of those questions of Grok 4 and its response was to
       | issue "an error". AFAIK, in many results quoted for AI
       | performance, filling the answer box yields full marks but I would
       | have recieved a big fat zero had I done the same.
        
         | godelski wrote:
         | As a physics undergraduate I had similar style tests for my
         | upper division classes (the classical mechanics professor and
         | loved these). We'd have like 3 days to do the test, open book,
         | open internet[0] and the professor extended his office hours,
         | but no help from peers. It really stretched your thinking.
         | Removed the time pressure but really gave the sense of what it
         | was like to be a real physicist.
         | 
         | Even though in the last decade a lot more of that complex
         | material appears online, there's still a lot that can't.
         | Unfortunately, I haven't seen any AI system come close to
         | answering any of these types of questions. Some look right at a
         | glance but often contain major errors pretty early on.
         | 
         | I wouldn't be surprised if an LLM can ace the Physics GRE. The
         | internet is filled with the test questions and there are so few
         | variations. But I'll be impressed when they can answer one of
         | these types of tests. They require that you actually do world
         | modeling (and not necessarily of the literal world, just the
         | world that the physics problem lives in[1]). Most humans can't
         | get these right without drawing diagrams. You got to pull a lot
         | of different moving information together.
         | 
         | [0] you were expected to report if you stumbled on the solution
         | somewhere. No one ever found one though
         | 
         | [1] an important distinction for those working on world models.
         | What world are you modeling? Which physics are you modeling?
        
           | bwfan123 wrote:
           | Would you mind sharing a sketch of one problem from the test
           | you mention ? I am interested in how it looks.
        
             | godelski wrote:
             | It's been a decade, so I don't have any of the actual tests
             | anymore. But the class used Marion and Thornton's Classical
             | Mechanics[0] and occasionally pulled from Goldstein's
             | book[1]. It was an undergrad class, so we only pulled from
             | the second in the Classical II class.
             | 
             | For these very tough physics (and math) problems usually
             | the most complex part is just getting started. Sure, there
             | would always be some complex weird calculation that needs
             | to be done, but often by the time you get to there you have
             | a general knowledge of what actually needs to be solved and
             | that gives you a lot of clues. For the classical we were
             | usually concerned with deriving the Hamiltonian of the
             | system[2]. By no means is the computation easy, but I found
             | (and this seemed to be common) that the hardest part was
             | getting everything set up and ensuring you have an accurate
             | description which to derive from. Small differences can be
             | killer and that was often the point. There are a lot of
             | tools that give you a kind of "sniff test" as to if you've
             | accounted for everything or not, but many of these are not
             | available until you've already gotten through a good chunk
             | of computation (or all the way!). Which, tbh, is really the
             | hard part of doing science. It is the attention to detail,
             | the nuances. Which should make sense, as if this didn't
             | matter we'd have solved everything long ago, right?
             | 
             | I mean in the experiment section of my optics class we also
             | were tested on things like just setting up a laser so that
             | it would properly lase. I was one of two people that could
             | reliably do it in my cohort. You had to be very meticulous
             | and constantly thinking about how the one part you're
             | working with is interacting with the system as a whole. Not
             | to mention the poor tolerances of our lab equipment lol.
             | 
             | Really, a lot of it comes down to world modeling. I'm an AI
             | researcher now and I think a lot of people really are
             | oversimplifying what this term actually means. Like many of
             | those physics problems, it looks simple at face value but
             | it isn't until you get into the depth that you see the
             | beauty and complexity of it all.[3]
             | 
             | [0] https://www.amazon.com/Classical-Dynamics-Particles-
             | Systems-...
             | 
             | [1] https://www.amazon.com/Classical-Mechanics-3rd-Herbert-
             | Golds...
             | 
             | [2] Once you're out of basic physics classes you usually
             | don't care about numbers. It is all about symbolic
             | manipulation. The point of physics is to generate causal
             | explanations, ones that are counterfactual. So you are
             | mainly interested in the description of the system because
             | from there you can plug in any numbers you wish. Joke is
             | that you do this then hand it off to the engineer or
             | computer.
             | 
             | [3] A pet peeve of mine is that people will say "I just
             | care that it works." I hate this because it is a shared
             | goal no matter your belief about approach (who doesn't want
             | it to work?! What an absurd dichotomy). The people that
             | think the AI system needs to derive (learn) realistic
             | enough laws of physics are driven because they are
             | explicitly concerned with things working. It's not about
             | "theory" as it is that this is a requirement for having a
             | generalizable solution. They understand how these subtle
             | differences _quickly_ cascade into big differences. I mean
             | your basic calculus level physics is good enough for a
             | spherical chicken in a vacuum but it gets much more complex
             | when you want to operate in the real world. Unfortunately
             | there aren 't things that can be determined purely through
             | observation (even in a purely mechanical universe).
        
       | ants_everywhere wrote:
       | I like this approach in general for understanding AI tools.
       | 
       | There are lots of things computers can do that humans can't, like
       | spawn N threads to complete a calculation. You _can_ fill a room
       | with N human calculators and combine their results.
       | 
       | If your goal is to just understand the raw performance of the AI
       | as a tool, then this distinction doesn't really matter. But if
       | you want to compare the performance of the AI on a task against
       | the performance of an individual human you have to control the
       | relevant variables.
        
       ___________________________________________________________________
       (page generated 2025-07-20 23:01 UTC)