[HN Gopher] Study identifies weaknesses in how AI systems are ev...
       ___________________________________________________________________
        
       Study identifies weaknesses in how AI systems are evaluated
        
       Paper: https://openreview.net/pdf?id=mdA5lVvNcU  Related:
       https://www.theregister.com/2025/11/07/measuring_ai_models_h...
        
       Author : pseudolus
       Score  : 264 points
       Date   : 2025-11-08 14:18 UTC (8 hours ago)
        
 (HTM) web link (www.oii.ox.ac.uk)
 (TXT) w3m dump (www.oii.ox.ac.uk)
        
       | Marshferm wrote:
       | Don't get high on your own supply.
        
       | calpaterson wrote:
       | Definitely one of the weaker areas in the current LLM boom.
       | Comparing models, or even different versions of the same model,
       | is a pseudo-scientific mess.
       | 
       | I'm still using https://lmarena.ai/leaderboard. Perhaps there is
       | something better and someone will pipe up to tell me about it.
       | But we use LLMs at work and have unexplainable variations between
       | them.
       | 
       | And when we get a prompt working reliably on one model, we often
       | have trouble porting it to another LLM - even straight "version
       | upgrades" such as from GPT-4 to -5. Your prompt and your model
       | become highly coupled quite easily.
       | 
       | I dunno what to do about it and am tending to just pick Gemini as
       | a result.
        
         | ACCount37 wrote:
         | Ratings on LMArena are too easily gamed.
         | 
         | Even professional human evaluators are quite vulnerable to
         | sycophancy and overconfident-and-wrong answers. And LMArena
         | evaluators aren't professionals.
         | 
         | A lot of the sycophancy mess that seeps from this generation of
         | LLM stems from reckless tuning based on human feedback. Tuning
         | for good LMArena performance has similar effects - and not at
         | all by a coincidence.
        
           | energy123 wrote:
           | It's biased to small context performance, which is why I
           | don't pay much attention to it as a developer aside from a
           | quick glance. I need performance at 40-100k tokens which
           | models like Deepseek can't deliver but Gemini 2.5 Pro and
           | ChatGPT 5.0 Thinking can.
        
             | ACCount37 wrote:
             | And even "long term performance" splits itself into
             | "performance on multi-turn instruction following" and
             | "performance on agentic tasks" down the line. And
             | "performance on agentic tasks" is a hydra in itself.
             | 
             | Capturing LLM performance with a single metric is a
             | hopeless task. But even a single flawed metric beats no
             | metrics at all.
        
         | HPsquared wrote:
         | Psychometric testing of humans has a lot of difficulties, too.
         | It's hard to measure some things.
        
         | diamond559 wrote:
         | I'd rather quit then be forced to beta test idiocracy. What's
         | your company so we can all avoid it?
        
         | botro wrote:
         | This is something I've stuggled with for my site, I made
         | https://aimodelreview.com/ to compare the outputs of LLMs over
         | a variety of prompts and categories, allowing a side by side
         | comparison between them. I ran each prompt 4 times for each
         | model with different temperature values available as a toggles.
         | 
         | My thinking was to just make the responses available to users
         | and let them see how models perform. But from some feedback,
         | turns out users don't want to have to evaluate the answers and
         | would rather see a leaderboard and rankings.
         | 
         | The scalable solution to that would be LLM as judge that some
         | benchmarks already use, but that just feels wrong to me.
         | 
         | LM Arena tries to solve this with the crowd sourced solution,
         | but I think the right method would have to be domain expert
         | human reviewers, so like Wirecutter VS IMDb, but that is
         | expensive to pull off.
        
       | bubblelicious wrote:
       | I work on LLM benchmarks and human evals for a living in a
       | research lab (as opposed to product). I can say: it's pretty much
       | the Wild West and a total disaster. No one really has a good
       | solution, and researchers are also in a huge rush and don't want
       | to end up making their whole job benchmarking. Even if you could,
       | and even if you have the right background you can do benchmarks
       | full time and they still would be a mess.
       | 
       | Product testing (with traditional A/B tests) are kind of the best
       | bet since you can measure what you care about _directly_ and at
       | scale.
       | 
       | I would say there is of course "benchmarketing" but generally
       | people do sincerely want to make good benchmarks it's just hard
       | or impossible. For many of these problems we're hitting
       | capabilities where we don't even have a decent paradigm to use,
        
         | ACCount37 wrote:
         | A/B testing is radioactive too. It's indirectly optimizing for
         | user feedback - less stupid than directly optimizing for user
         | feedback, but still quite dangerous.
         | 
         | Human raters are exploitable, and you never know whether the B
         | has a genuine performance advantage over A, or just found a
         | meat exploit by an accident.
         | 
         | It's what fucked OpenAI over with 4o, and fucked over many
         | other labs in more subtle ways.
        
           | bubblelicious wrote:
           | Are you talking about just preferences or A/B tests on like
           | retention and engagement? The latter I think is pretty
           | reliable and powerful though I have never personally done
           | them. Preferences are just as big a mess: WHO the annotators
           | are matters, and if you are using preferences as a proxy for
           | like correctness, you're not really measuring correctness
           | you're measuring e.g. persuasion. A lot of construct validity
           | challenges (which themselves are hard to even measure in
           | domain).
        
             | ACCount37 wrote:
             | Yes. All of them are poisoned metrics, just in different
             | ways.
             | 
             | GPT-4o's endless sycophancy was great for retention,
             | GPT-5's style of ending every response in a question is
             | great for engagement.
             | 
             | Are those desirable traits though? Doubt it. They look like
             | simple tricks and reek of reward hacking - and A/B testing
             | rewards them indeed. Direct optimization is even worse.
             | Combining the two is ruinous.
             | 
             | Mind, I'm not saying that those metrics are useless.
             | Radioactive materials aren't useless. You just got to keep
             | their unpleasant properties in mind at all times - or
             | suffer the consequences.
        
         | bjackman wrote:
         | For what it's worth, I work on platforms infra at a hyperscaler
         | and benchmarks are a complete fucking joke in my field too lol.
         | 
         | Ultimately we are measuring extremely measurable things that
         | have an objective ground truth. And yet:
         | 
         | - we completely fail at statistics (the MAJORITY of analysis is
         | literally just "here's the delta in the mean of these two
         | samples". If I ever do see people gesturing at actual proper
         | analysis, if prompted they'll always admit "yeah, well, we do
         | come up with a p-value or a confidence interval, but we're
         | pretty sure the way we calculate it is bullshit")
         | 
         | - the benchmarks are almost never predictive of the performance
         | of real world workloads anyway
         | 
         | - we can obviously always just experiment in prod but then the
         | noise levels are so high that you can entirely miss million-
         | dollar losses. And by the time you get prod data you've already
         | invested at best several engineer-weeks of effort.
         | 
         | AND this is a field where the economic incentives for accurate
         | predictions are enormous.
         | 
         | In AI, you are measuring weird and fuzzy stuff, and you kinda
         | have an incentive to just measure some noise that looks good
         | for your stock price anyway. AND then there's contamination.
         | 
         | Looking at it this way, it would be very surprising if the
         | world of LLM benchmarks was anything but a complete and utter
         | shitshow!
        
           | bofadeez wrote:
           | Even a p-value is insufficient. Maybe can use some of this
           | stuff https://web.stanford.edu/~swager/causal_inf_book.pdf
        
             | bjackman wrote:
             | I have actually been thinking of hiring some training
             | contractors to come in and teach people the basics of
             | applied statistical inference. I think with a bit of
             | internal selling, engineers would generally be interested
             | enough to show up and pay attention. And I don't think we
             | need very deep expertise, just a moderate bump in the
             | ambient level of statistical awareness would probably go a
             | long way.
             | 
             | It's not like there's a shortage of skills in this area, it
             | seems like our one specific industry just has a weird
             | blindspot.
        
               | stogot wrote:
               | Don't most computer science programs require this? Mine
               | had a statistics requirement
        
           | jiggawatts wrote:
           | "Here's the throughout at sustained 100% load with the same
           | ten sample queries repeated over and over."
           | 
           | "The customers want lower latency at 30% load for unique
           | queries."
           | 
           | "Err... we can scale up for more throughput!"
           | 
           | tth_tth
        
           | fragmede wrote:
           | In AI though, you also have the world trying to compete with
           | you, so even if you _do_ totally cheat and put the benchmark
           | answers in your training set and over fit, if it turns out
           | that you model sucks, it doesn 't matter how much your
           | marketing department tells everyone you scored 110% on SWE
           | bench, if it doesn't work out that well in production, your
           | announcement's going to flow as users discover it doesn't
           | work that well on their personal/internal secret benchmarks
           | and tell /r/localLLAMA it isn't worth the download.
           | 
           | Whatever happened with Llama 4?
        
         | bofadeez wrote:
         | Has your lab tried using any of the newer causal inference-
         | style evaluation methods? Things like interventional or
         | counterfactual benchmarking, or causal graphs to tease apart
         | real reasoning gains from data or scale effects. Wondering if
         | that's something you've looked into yet, or if it's still too
         | experimental for practical benchmarking work.
        
         | j45 wrote:
         | What gets measured, gets managed and improved, though.
        
         | andy99 wrote:
         | I also work in LLM evaluation. My cynical take is that nobody
         | is really using LLMs for stuff, and so benchmarks are mostly
         | just make up tasks (coding is probably the exception). If we
         | had real specific use cases it should be easier to benchmark
         | and know if one is better, but it's mostly all hypothetical.
         | 
         | The more generous take is that you can't benchmarks advanced
         | intelligence very well, whether LLM or person. We don't have
         | good procedures for assessing a person's fit-for-purpose e.g.
         | for a job, certainly not standardized question sets. Why would
         | we expect to be able to do this with AI?
         | 
         | I think both of these takes are present to some extent in
         | reality.
        
       | jennyholzer wrote:
       | I've been getting flagged by high-on-their-own-supply AI boosters
       | for identifying that LLM benchmarks have been obvious bullshit
       | for at least the last year and a half.
       | 
       | What changed to make "the inevitable AI bubble" the dominant
       | narrative in last week or so?
        
         | conception wrote:
         | Companies are talking about needing trillions of dollars is
         | why.
        
           | foobiekr wrote:
           | And the government backstops.
           | 
           | Nothing says confidence that AGI is imminent like needing the
           | US government to prevent your investments from losing you
           | money.
        
         | HPsquared wrote:
         | Benchmarks in general have this problem, across pretty much all
         | industries. "When a measure becomes a target" and all that.
        
         | icameron wrote:
         | The market was down for AI related stocks especially, while
         | down only over 3% it's the worst week since April, and there's
         | no single event that is to blame it just looks like market
         | sentiment has shifted away from the previous unchecked
         | exuberance.
        
         | Kiro wrote:
         | Link those comments please because I checked your history and
         | the flagged ones were pure nonsense with zero insights. Also,
         | calling out LLM benchmarks has never been a radical take and
         | basically the default on this site.
        
         | purple_turtle wrote:
         | It is possible to be right on the main theme but only by
         | accident (with arguments and claims being wrong), communicating
         | in a highly faulty way, with pointless insults, doing it in
         | offtopic derails, being correct on minor point while being
         | mostly wrong etc.
         | 
         | Can you link some of these comments you consider useful but got
         | flagged?
        
       | shanev wrote:
       | This is solvable at the level of an individual developer. Write
       | your own benchmark for code problems that you've solved. Verify
       | tests pass and that it satisfies your metrics like tok/s and
       | TTFT. Create a harness that works with API keys or local models
       | (if you're going that route).
        
         | motoboi wrote:
         | Well, openai github is open to write evaluations. Just add your
         | there and guaranteed that the next model will perform better on
         | them.
        
         | cactusplant7374 wrote:
         | I think that's what this site is doing:
         | https://aistupidlevel.info/
        
         | davedx wrote:
         | That's called evals and yes any serious AI project uses them
        
         | hamdingers wrote:
         | At the developer level all my LLM use is in the context of
         | agentic wrappers, so my benchmark is fairly trivial:
         | 
         | Configure aider or claude code to use the new model, try to do
         | some work. The benchmark is pass/fail, if after a little while
         | I feel the performance is better than the last model I was
         | using it's a pass, otherwise it's a fail and I go back.
         | 
         | Building your own evaluations makes sense if you're serving an
         | LLM up to customers and want to know how it performs, but if
         | you are the user... use it and see how it goes. It's all
         | subjective anyway.
        
           | embedding-shape wrote:
           | > Building your own evaluations makes sense if you're serving
           | an LLM up to customers and want to know how it performs, but
           | if you are the user... use it and see how it goes. It's all
           | subjective anyway.
           | 
           | I'd really caution against this approach, mainly because
           | humans suck at removing emotions and other "human" factors
           | when judging how well something works, but also because
           | comparing across models gets a lot easier when you can see
           | 77/100 vs 91/100 as a percentage score, over your own tasks
           | that you actually use the LLMs for. Just don't share this
           | benchmark publicly once you're using it for measurements.
        
             | hamdingers wrote:
             | So what? I'm the one that's using it, I happen to be a
             | human, my human factor is the only one that matters.
             | 
             | At this point anyone using these LLMs every day have seen
             | those benchmark numbers go up without an appreciable
             | improvement in the day to day experience.
        
               | embedding-shape wrote:
               | > So what? I'm the one that's using it, I happen to be a
               | human, my human factor is the only one that matters.
               | 
               | Yeah no you're right, if consistency isn't important to
               | you as a human, then it doesn't matter. Personally, I
               | don't trust my "humanness" and correctness is the most
               | important thing for me when working with LLMs, so that's
               | why my benchmarks focus on.
               | 
               | > At this point anyone using these LLMs every day have
               | seen those benchmark numbers go up without an appreciable
               | improvement in the day to day experience.
               | 
               | Yes, this is exactly my point. The benchmarks the makers
               | of these LLMs seems to always provide a better and better
               | score, yet the top scores in my own benchmarks have been
               | more or less the same for the last 1.5 years, and I'm
               | trying every LLM I can come across. These "the best LLM
               | to date!" hardly ever actually is the "best available
               | LLM", and while you could make that judgement by just
               | playing around with LLMs, actually be able to point to
               | specifically _why_ that is, is something at least I find
               | useful, YMMV.
        
         | j45 wrote:
         | We have to keep in mind that "solving" might mean having the
         | LLM recognize the pattern of solving something.
        
       | bee_rider wrote:
       | > "For example, if a benchmark reuses questions from a
       | calculator-free exam such as AIME," the study says, "numbers in
       | each problem will have been chosen to facilitate basic
       | arithmetic. Testing only on these problems would not predict
       | performance on larger numbers, where LLMs struggle."
       | 
       | When models figure out how to exploit an effect that every clever
       | college student does, that should count as a win. That's a much
       | more human-like reasoning ability, than the ability to multiply
       | large numbers or whatever (computers were already good at that,
       | to the point that it has become a useless skill for humans to
       | have). The point of these LLMs is to do things that computers
       | were bad at.
        
         | layer8 wrote:
         | I don't think the fact that LLMs can handle small numbers more
         | reliably has anything to do with their reasoning ability. To
         | the contrary, reasoning ability should enable them to handle
         | numbers of arbitrary size, just as it enables humans to do so,
         | given some pencil and paper.
         | 
         | However:
         | 
         | > Testing only on these problems would not predict performance
         | on larger numbers, where LLMs struggle.
         | 
         | Since performance on large numbers is not what these exams are
         | intended to test for, I don't see this as a counterargument,
         | unless the benchmarks are misrepresenting what is being tested
         | for.
        
           | zamadatix wrote:
           | Pencil and paper is just testing with tools enabled.
        
             | layer8 wrote:
             | You seem to be addressing an argument that wasn't made.
             | 
             | Personally, I'd say that such tool use is more akin to a
             | human using a calculator.
        
               | zamadatix wrote:
               | I'm not addressing an argument, just stating that's
               | already a form of LLM testing done today for people
               | wanting to look at the difference in results the same as
               | the human analogy.
        
               | layer8 wrote:
               | Okay, but then I don't understand why you replied to my
               | comment for that, there is no direct connection to what I
               | wrote, nor to what bee_rider wrote.
        
               | zamadatix wrote:
               | > To the contrary, reasoning ability should enable them
               | to handle numbers of arbitrary size, just as it enables
               | humans to do so, given some pencil and paper.
               | 
               | People interested can see the results of giving LLMs pen
               | and paper today by looking at benchmarks with tools
               | enabled. It's an addition to what you said, not an attack
               | on a portion of your comment :).
        
               | layer8 wrote:
               | I see now. My focus was on the effect of LLMs' (and by
               | analogy, humans') reasoning abilities argued by
               | bee_rider. The fact that tool use can enable more
               | reliable handling of large numbers has no bearing on
               | that, hence I found the reply confusing.
        
               | zamadatix wrote:
               | Hmm, maybe it depends on the specific test and reasoning
               | in it? I certainly think reasoning how and when to use
               | allowed tools and when not to is a big part of the
               | reasoning and verification process E.g. most human math
               | scores allow for a pen and paper calculation, or even a
               | calculator, and that can be a great way to say spot check
               | a symbolic derivative and see it needs to be revisited
               | without relying on the calculator/paper to do the actual
               | reasoning for the testee. Or to see the equation for
               | motion of a system can't possibly have been right with
               | some test values (without which I'm not sure I'd have
               | passed my mid level physics course haha).
               | 
               | At the very least, the scores for benchmarking a human on
               | such a test with and without tools would be different to
               | comparing an LLM without the analogous constraints. Which
               | is (IMO) a useful note in comparing reasoning abilities
               | and why I thought it was interesting to note this kind of
               | testing is just called testing with tools on the LLM side
               | (not sure there is an equally as standard term on the
               | human testing side? Guess the same could be used for both
               | though).
               | 
               | At the same time I'm sure other reasoning tests don't
               | gain much from/expect use of tools at all. So it wouldn't
               | be relevant for those reasoning tests.
        
             | LadyCailin wrote:
             | I'd say it's fair for LLMs to be able to use any tool in
             | benchmarks, so long as they are the ones to decide to use
             | them.
        
               | zamadatix wrote:
               | Agreed. I don't like when the prompt sets up a good
               | portion of how to go about finding the answer by saying
               | which tools to use and how. The LLM needs to decide when
               | and how to use them, not the prompt.
        
               | daveguy wrote:
               | I don't think it should be completely open ended. I mean,
               | you could have an "ask_hooman" tool that solves a ton of
               | problems with current LLMs. But that doesn't mean the LLM
               | is capable with respect to the benchmark.
        
               | vntok wrote:
               | Why not? One of the most intelligent things to do when
               | stuck on a problem is to get outside help.
               | 
               | If allowing this behaviour raises a problem, you can
               | always add constraints to the benchmark such as "final
               | answer must come out under 15s" or something. The LLM can
               | then make the decision to ask around in accordance to the
               | time risk.
        
               | daveguy wrote:
               | Because AI are good at devolving to the highest score,
               | regardless of test intent. For most problems
               | "ask_hooman", or especially the plural, would be much
               | more effective. So, the degenerate case would dominate
               | and tell you precisely zero about the intelligence of the
               | AI. If a specific "tool" is more adept than the "AI" then
               | "choose tool" will always be the correct answer. But I
               | agree, a tight time constraint would help.
        
             | Dylan16807 wrote:
             | On some level this makes sense, but on the other hand LLMs
             | already have perfect recall of thousands of symbols built
             | into them, which is what pencil and paper gives to a human
             | test taker.
        
               | zamadatix wrote:
               | If only context recall was actually perfect! The data is
               | certainly stored well, accurately accessing the right
               | part... maybe worse than a human :D.
        
               | Dylan16807 wrote:
               | If you're not doing clever hacks for very long windows, I
               | thought a basic design fed in the entire window and it's
               | up to the weights to use it properly.
        
           | luke0016 wrote:
           | > reasoning ability should enable them to handle numbers of
           | arbitrary size, just as it enables humans to do so, given
           | some pencil and paper.
           | 
           | Or given a calculator. Which it's running on. Which it in
           | some sense _is_. There 's something deeply ironic about the
           | fact that we have an "AI" running on the most technologically
           | advanced calculator in the history of mankind and...it can't
           | do basic math.
        
             | halJordan wrote:
             | This is a very unserious take. It's not ironic, because
             | it's not a calculator.
        
               | rrr_oh_man wrote:
               | What's meaning of `computer`, remind me quick?
        
               | anamexis wrote:
               | Computer vision algorithms run on computers and they
               | can't do basic arithmetic.
               | 
               | My email client runs on my computer and it doesn't do
               | basic arithmetic either.
               | 
               | Something running on a computer does not imply that it
               | can or should do basic arithmetic
        
               | TheOtherHobbes wrote:
               | That's confusing basic arithmetic as a user feature and
               | as an implementation requirement.
               | 
               | I guarantee that computer vision and email clients both
               | use basic arithmetic in implementation. And it would be
               | trivially easy to bolt a calculator into an email app,
               | because the languages used to write email apps include
               | math features.
               | 
               | That's not true of LLMs. There's math at the bottom of
               | the stack. But LLMs run as a separate closed and opaque
               | application of a unique and self-contained type, which
               | isn't easily extensible.
               | 
               | They don't include hooks into math features on the GPUs,
               | and there's no easy way to add hooks.
               | 
               | If you want math, you need a separate tool call to
               | conventional code.
               | 
               | IMO testing LLMs as if they "should" be able to do
               | arithmetic is bizarre. They can't. They're not designed
               | to. And even if they did, they'd be ridiculously
               | inefficient at it.
        
               | anamexis wrote:
               | Yes, you are agreeing with me.
        
               | gishh wrote:
               | Pretty sure the only thing computer vision does is math.
               | 
               | I've also observed email clients tallying the number of
               | unread emails I have. It's quite obnoxious actually, but
               | I qualify adding as math.
        
               | anamexis wrote:
               | Yes, everything that a computer does, it does using math.
               | This does not imply that things running on the computer
               | can do basic arithmetic tasks for the user.
        
               | ghurtado wrote:
               | > Pretty sure the only thing computer vision does is
               | math.
               | 
               | That is only marginally less pedantic than saying that
               | the only thing computer vision does is run discrete
               | electrical signals through billions of transistors.
        
             | benjiro wrote:
             | Thing is, a LLM is nothing but a prediction algorithm based
             | upon what it trained. So it missing basic calculator
             | functionality is a given. This is why tool usage is more
             | and more a thing for LLMs. So that the LLM can from itself
             | use a calculator for the actual math parts it needs. Thus
             | increasing accuracy ...
        
               | throwup238 wrote:
               | Why is it a given? The universal approximation theorem
               | should apply since addition is a continuous function. Now
               | whether the network is sufficiently trained for that is
               | another question but I don't think it's a given that a
               | trillion parameter model can't approximate the most basic
               | math operations.
               | 
               | I think the tokenization is a bigger problem than the
               | model itself.
        
               | benjiro wrote:
               | Easy to answer that one ... predictions are based upon
               | accuracy. So if you have a int4 vs a float16, the chance
               | that the prediction goes off is higher with a int4. But
               | even with a float16, your still going to run into issues
               | where your prediction model goes off. Its going to be a
               | lot less, your still going to get rounding issue, what
               | may result in a 5 being a 8 (just a example).
               | 
               | So while it can look like a LLM calculates correctly, its
               | still restricted by this accuracy issue. What happens
               | when you get a single number wrong in a calculation,
               | everything is wrong.
               | 
               | While a calculator does not deal with predictions but
               | basic adding/multiplying/subtracting etc .. Things that
               | are 100% accurate (if we not not count issues like cosmic
               | rays hitting, failures in silica etc).
               | 
               | A trillion parameter model is just that, a trillion
               | parameters, but what matter is not the tokens but the
               | accuracy as in, the do they use int, float16, float32,
               | float64 ... The issue is, the higher we go, the memory
               | usage explodes.
               | 
               | There is no point in spending terabytes of memory, to
               | just get a somewhat accurate predictive calculator, when
               | we can just have the LLM call a actual calculator, to
               | ensure its results are accurate.
               | 
               | Think of a LLM more like somebody with Dyslexia /
               | Dyscalculia... It does not matter how good you are, all
               | it takes is to switch one number in a algebraic
               | calculation to get a 0/10 ... The reason why i mention
               | this, is because i often think of a LLM like a person
               | with Dyslexia / Dyscalculia. It can have insane
               | knowledge, be smart, but be considered dumb by society
               | because of that less then accurate prediction (or number
               | swiping issue).
               | 
               | Take it from somebody that wasted a few years in school
               | thanks to that issue, it really does not matter if your a
               | good programmer later in life, when you flunk a few years
               | thanks to undiagnosed issues. And yet, just like a LLM, i
               | simply rely on tool usage to fix my inaccuracy issues. No
               | point in wasting good shoulder space trying to graft a
               | dozen more heads/brains onto me, when i can simply
               | delegate the issue away. ;)
               | 
               | The fact that we can get computer models, that can almost
               | program, write texts, ... and do so much more like a
               | slightly malfunctioning human, amazes me. And at the same
               | time, i curse at it like my teachers did, and also call
               | it dumb at times hehehe ... I now understand how my
               | teachers felt _loool_
        
               | DrewADesign wrote:
               | If they were selling LLMs as "LLMs" instead of magic
               | code-writing, answer-giving PhD replacements, the lack of
               | basic arithmetic capability _would_ be a _given_ ... but
               | they _aren't._ Judging a paid service using their own
               | implied claims is perfectly reasonable.
        
             | novok wrote:
             | This is like saying it's ironic that an alternator in a car
             | cannot combust gasoline when the gasoline engine is right
             | beside it, even though the alternator 'runs' on the
             | gasoline engine.
        
               | luke0016 wrote:
               | Or similarly having a gasoline engine without an
               | alternator and making the observation that there's an
               | absurdity there in that you're generating large amounts
               | of energy, yet aren't able to charge a relatively small
               | 12V battery with any of it. It's a very practical and
               | natural limitation, yet in some sense you have exactly
               | what you want - energy - you just can't use it because of
               | the form. If you step back there's an amusing irony
               | buried in that. At least in my humble opinion :-)
        
           | ambicapter wrote:
           | > Since performance on large numbers is not what these exams
           | are intended to test for,
           | 
           | How so? Isn't the point of these exams to test arithmetic
           | skills? I would hope we'd like arithmetic skills to be at a
           | constant level regardless of the size of the number?
        
             | singron wrote:
             | No. AIME is a test for advanced high schoolers that mostly
             | tests higher level math concepts like algebra and
             | combinatorics. The arithmetic required is basic. All the
             | answers are 3-digit numbers so that judging is objective
             | and automated while making guessing infeasible. You have 12
             | minutes on average for each question, so even if you are
             | terribly slow at arithmetic, you should still be able to
             | calculate the correct answer if you can perform all the
             | other math.
        
               | ambicapter wrote:
               | That's probably a great test for high schoolers but it
               | doesn't really test what we want from AI, no? I would
               | expect AI to be limited by the far greater constraints of
               | its computing ability, and not the working memory of a
               | human high schooler.
        
         | gardnr wrote:
         | A discussion on models "figuring out" things:
         | https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden
         | Technique)
        
         | BolexNOLA wrote:
         | >the point of these LLMs is to do things that computers were
         | bad at.
         | 
         | The way they're being deployed it feels like the point of LLMs
         | is largely to replace basic online search or to run your online
         | customer support cheaply.
         | 
         | I'm a bit out on a limb here because this is not really my
         | technical expertise by any stretch of the imagination, but it
         | seems to me these benchmark tests don't really tell us much
         | about how LLM's perform in the ways most people actually use
         | them. Maybe I'm off base here though
        
           | riskable wrote:
           | Nobody really knows "the point" of LLMs yet. They weren't
           | even "invented" as much as they _emerged_ as a trick to get
           | computers to better understand human language.
           | 
           | They're still brand spanking new and everyone's trying to
           | figure out how to best use them. We don't even really know if
           | they're _ever_ going to be  "really good at" any given task!
           | 
           | Are they "really good at" these things or are they merely
           | "OK-ish"?                   * Answering factual questions.
           | * Programming.         * Understanding what the user wants
           | from natural language.          * Searching/recommending
           | stuff.
           | 
           | Real world testing suggests that with billions and billions
           | of dollars spent, you really _can_ get an LLM to be  "OK-ish"
           | at all those things :D
        
         | nradov wrote:
         | LLMs can probably be taught or configured to use external tools
         | like Excel or Mathematica when such calculations are needed.
         | Just like humans. There are plenty of untapped optimization
         | opportunities.
        
           | davedx wrote:
           | Chatgpt spins up a python sandbox for any complex
           | calculations. It's been able to do that for a while now
        
         | 6510 wrote:
         | I don't claim to know anything but I thought tool usage was a
         | major sign of intelligence. For example floats are a wonderful
         | technology but people use them as if chainsaws are great for
         | cutting bread and butter. We now have entire languages that
         | cant do basic arithmetic. I thought it was alarming: _People it
         | cant compute like this!_ Now we have language models, those are
         | still computers, why cant we just give them.. you know...
         | calculators? Arguably the best thing their universe has to
         | offer.
         | 
         | edit: I forgot my point: calculating big numbers is not a real
         | world problem anyone has.
        
           | yunyu wrote:
           | We do? Tool use started coming in vogue around 2023
        
             | riskable wrote:
             | Actually, tool use started coming into vogue around 3.3
             | million years ago.
        
         | novok wrote:
         | IMO I think the calculator problem goes away with tool use or
         | NN architectures that basically add a calculator equivalent as
         | one of the potential 'experts' or similar. It won't be much of
         | a trope for longer.
        
           | davedx wrote:
           | Chatgpt has been calculating things in its python sandbox for
           | years already. This is a trope indeed
        
         | jvanderbot wrote:
         | Absolutely not.
         | 
         | College exam takers use those tricks because they are on a time
         | limit and are gaming the system. It's clever and wink wink
         | nudge nudge ok everyone does it. But it's one tiny signal in a
         | huge spectrum of things we use to evaluate _people_.
         | 
         | Instead, these metrics are gamed and presented as the _entire_
         | multi special signal of competence for LLMs because it is
         | literally impossible to say that success in one domain would
         | translate the way it might with a good hire.
         | 
         | What I want is something I don't have to guard against gaming.
         | Something conscientious and capable like my co workers. Until
         | then it's google version 2 married to intellisense and I'm not
         | letting do anything by itself.
        
         | joe_the_user wrote:
         | _The point of these LLMs is to do things that computers were
         | bad at._
         | 
         | That's a good point imo but we achieved this stuff by at least
         | 2022 when ChatGPT was released. The thing about these giant
         | black boxes is that they also fail to do things that directly
         | human-written software ("computers") does easily. The inability
         | to print text onto generated images or do general arithmetic is
         | important. And sure, some of these limits look like "limits of
         | humans". But it is important to avoid jumping from "they do
         | this human-thing" to "they're like humans".
        
       | moritzwarhier wrote:
       | When people claim that there is such a thing as "X% accuracy in
       | reasoning", it's really hard to take anything else seriously, no
       | matter how impressive.
       | 
       | AI (and humans!) aside, claiming that there was an oracle that
       | could "answer all questions" is a solved problem. Such a thing
       | cannot exist.
       | 
       | But this is going already too deep IMO.
       | 
       | When people start talking about percentages or benchmark scores,
       | there has to be some denominator.
       | 
       | And there can be no bias-free such denominator for
       | 
       | - trivia questions
       | 
       | - mathematical questions (oh, maybe I'm wrong here, intuitively
       | I'd say it's impossible for various reasons: varying "hardness",
       | undecidable problems etc)
       | 
       | - historical or policital questions
       | 
       | I wanted to include "software development tasks", but it would be
       | a distraction. Maybe there will be a good benchmark for this, I'm
       | aware there are plenty already. Maybe AI will be capable to be a
       | better software developer than me in some capacity, so I don't
       | want to include this part here. That also maps pretty well to
       | "the better the problem description, the better the output",
       | which doesn't seem to work so neatly with the other categories of
       | tasks and questions.
       | 
       | Even if the whole body of questions/tasks/prompts would be very
       | constrained and cover only a single domain, it seems impossible
       | to guarantee that such benchmark is "bias-free" (I know AGI folks
       | love this word).
       | 
       | Maybe in some interesting special cases? For example, very
       | constrained and clearly defined classes of questions, at which
       | point, the "language" part of LLMs seems to become less important
       | and more of a distraction. Sure, AI is not just LLMs, and LLMs
       | are not just assistants, and Neural Networks are not just LLMs...
       | 
       | There the problem begins to be honest: I don't even know how to
       | align the "benchmark" claims with the kind of AI they are
       | examinin and the ones I know exist.
       | 
       | Sure it's possible to benchmark how well an AI decides whether,
       | for example, a picture shows a rabbit. Even then: for some
       | pictures, it's gotta be undecidable, no matter how good the
       | training data is?
       | 
       | I'm just a complete layman and commenting about this; I'm not
       | even fluent in the absolute basics of artificial neural networks
       | like perceptrons, gradient descent, backpropagation and typical
       | non-LLM CNNs that are used today, GANs etc.
       | 
       | I am and was impressed by AI and deep learning, but to this day I
       | am thorougly disappointed by the hubris of snakeoil salespeople
       | who think it's valuable and meaningful to "benchmark" machines on
       | "general reasoning".
       | 
       | I mean, it's already a thing in humans. There are IQ tests for
       | the non-trivia parts. And even these have plenty of discussion
       | revolving around them, for good reason.
       | 
       | Is there some "AI benchmark" that exclusively focuses on doing
       | recent IQ tests on models, preferably editions that were
       | published after the particular knowledge cutoff of the respective
       | models? I found (for example) this study [1], but to be honest,
       | I'm not the kind of person who is able to get the core insights
       | presented in such a paper by skimming through it.
       | 
       | Because I think there _are_ impressive results, it 's just
       | becomimg very hard to see through the bullshit at as an average
       | person.
       | 
       | I would also love to understand mroe about the current state of
       | the research on the "LLMs as compression" topic [2][3].
       | 
       | [1] https://arxiv.org/pdf/2507.20208
       | 
       | [2] https://www.mattmahoney.net/dc/text.html
       | 
       | [3] https://arxiv.org/abs/2410.21352
        
       | SurceBeats wrote:
       | Benchmarks optimize for fundraising, not users. The gap between
       | "state of the art" and "previous gen" keeps shrinking in real-
       | world use, but investors still write checks based on decimal
       | points in test scores.
        
         | yawnxyz wrote:
         | we try to make benchmarks for users, but it's like that 20%
         | article - different people want different 20% and you just end
         | up adding "features" and whackamoling the different kinds of
         | 20%
         | 
         | if a single benchmark could be a universal truth, and it was
         | easy to figure out how to do it, everyone would love that.. but
         | that's why we're in the state we're in right now
        
           | DrewADesign wrote:
           | The problem isn't with the benchmarks (or the models, for
           | that matter) it's their being used to prop up the
           | indefensible _product_ marketing claims made by people
           | frantically justifying asking for more dump trucks of
           | thousand-dollar bills to replace the ones they just burned
           | through in a few months.
        
       | RA_Fisher wrote:
       | For statistical AI models, we can use out of sample prediction
       | error as an objective measure to compare models. What makes
       | evaluating LLMs difficult is that comparisons are inextricable
       | from utility (whereas statistical AI models do have a pre-utility
       | step wherein it can be shown out of sample prediction epsilon is
       | minimized).
        
       | SkyPuncher wrote:
       | Benchmarks are nothing more than highly contextual specs (in
       | traditional code). They demonstrate your code works in a certain
       | way in certain use cases, but they do not prove your code works
       | as expected in all use cases.
        
         | embedding-shape wrote:
         | > Program testing can be used to show the presence of bugs, but
         | never to show their absence. Edsger W. Dijkstra
         | 
         | Maybe we need something similar for benchmarks, and updated for
         | today's LLMs, like:
         | 
         | > LLM benchmarks can be used to show what tasks they can do,
         | but never to show what tasks they cannot.
        
       | pahae wrote:
       | I wish the big providers would offer some sort of trial period
       | where you can evaluate models in a _realistic_ setting yourself
       | (i.e cli tools or IDE integrations). I wouldn't even mind strict
       | limits -- just give me two hours or so of usage and I'd already
       | be happy. Seriously.
       | 
       | My use-case is probably pretty far from the usual tasks: I'm
       | currently implementing a full observability platform based on
       | VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate
       | and has practically no overlap with the usual/cloud solutions you
       | find out there. For example, it uses an authenticated query
       | stack: I use the Grafana oauth token to authenticate queries by
       | injecting matchers via prom-label-proxy and forward that to
       | promxy for fan-out to different datasources (using the label
       | filter to only query some datasources). The IaC stuff is also not
       | mainstream as I'm not using any of the big cloud providers, but
       | the provider I use nonetheless has a terraform provider.
       | 
       | As you can imagine, there's probably not much training data for
       | most of this, so quality of the responses varies widely. From my
       | experience so far Claude (Sonnet 4.5 ) does a _much_ better job
       | than GTP-5 (Codex or normal) with the day-to-day task. Stuff like
       | keeping documentation up to date, spotting inconsistencies,
       | helping me find blind spots in the Alerting rules, etc. It also
       | seems to do better working with provided documentation / links.
       | 
       | I've been using Claude for a couple of weeks now but recently
       | switched to codex after my subscription to Claude ran out. I was
       | really curious after reading a lot of good things about it but I
       | gotta say, so far, I'm not impressed. Compared to Claude it gives
       | wrong answers much more frequently (at least in this domain). The
       | results it produces take much more effort to clean up than
       | Claude's. Probably on a level where I could just invest the time
       | myself. Might be that I do not yet know how to correctly prompt
       | GPT but giving both tools the same prompt, Claude does a better
       | job 90% of the time.
       | 
       | Anyway, I guess this is my long-winded way of saying that the
       | quality of responses "off the beaten track" varies widely and is
       | worth testing several models with. Especially if your work is not
       | 70+% of coding. Even then I guess that many benchmarks have
       | seized being useful by now?
        
         | cavisne wrote:
         | https://wuu73.org/blog/aiguide1.html
         | 
         | You can get a lot of free usage out of the models.
        
         | tim333 wrote:
         | There's the github copilot 30 day trial? "Access to Anthropic
         | Claude Sonnet 4, GPT-5, Gemini 2.5 Pro, and more 300 premium
         | requests to use the latest models and code review"
        
       | bbor wrote:
       | I'm already quite put off by the title (it's science -- if you
       | have a better benchmark, publish it!), but the contents aren't
       | great either. It keeps citing numbers about "445 LLM benchmarks"
       | without confirming whether any of the ones they deem
       | insufficiently statistical are used by any of the major players.
       | I've seen a lot of benchmarks, but _maybe_ 20 are used regularly
       | by large labs, max.                 "For example, if a benchmark
       | reuses questions from a calculator-free exam such as AIME," the
       | study says, "numbers in each problem will have been chosen to
       | facilitate basic arithmetic. Testing only on these problems would
       | not predict performance on larger numbers, where LLMs struggle."
       | 
       | For a math-based critique, this seems to ignore a glaring
       | problem: is it even _possible_ to randomly sample all natural
       | numbers? As another comment pointed out we wouldn 't even want to
       | ("LLMs can't accurately multiply 6-digit numbers" isn't something
       | anyone cares about/expected them to do in the first place), but
       | regardless: this seems like a vacuous critique dressed up in a
       | costume of mathematical rigor.                 At least some of
       | those who design benchmark tests are aware of these concerns.
       | 
       | In related news, at least some scientists studying climate change
       | are aware that their methods are imperfect. More at 11!
       | 
       | If anyone doubts my concerns and thinks this article is in good
       | faith, just check out this site's "AI+ML" section:
       | https://www.theregister.com/software/ai_ml/
        
         | daveguy wrote:
         | The article references this review:
         | 
         | https://openreview.net/pdf?id=mdA5lVvNcU
         | 
         | And the review is pretty damning regarding statistical validity
         | of LLM benchmarks.
        
         | dang wrote:
         | (We've since changed both title and URL - see
         | https://news.ycombinator.com/item?id=45860056)
        
       | wolttam wrote:
       | I'd like to see some video generation benchmarks. For example,
       | one that tested a model's ability to generate POV footage of a
       | humanoid form carrying out typical household tasks
       | 
       | Even if it requires human evaluators at first, and even if the
       | models _completely suck_ at this task right now: it seems like
       | the kind of task you 'd want them to be good at, if you want
       | these models to eventually carry out these tasks in embodied
       | forms in the real world.
       | 
       | Just having the benchmark in the first place is what gives model
       | makers something to optimize for.
        
         | luckydata wrote:
         | Generating footage wouldn't help with the opposite but
         | navigating a simulation would which is a pretty standard type
         | of evaluation for multimodal AIs designed to act in the real
         | world.
        
           | wolttam wrote:
           | Do you mean that it wouldn't help with ingesting footage and
           | then determining how to act?
           | 
           | I can imagine a robotics architecture where you have one
           | model generating footage (next frames for what it is
           | currently seeing) and another dumber model which takes in the
           | generated footage and only knows how to generate the
           | motor/servo control outputs needed to control whatever robot
           | platform it is integrated with.
           | 
           | I think that kind of architecture decoupling would be nice.
           | It allows the model with all the world and task-specific
           | knowledge to be agnostic from its underlying robot platform.
        
       | lysace wrote:
       | Tech companies/bloggers/press/etc are perpetually bad at
       | benchmarks. For browsers they kept pushing simplistic javascript-
       | centric benchmarks even when it was clear for at least 15 years
       | that layout/paint/network/etc were the dominant bottlenecks in
       | real-world usage.
       | 
       | It's primarily marketing-driven. I think the technical parts of
       | companies need to attempt to own this more.
       | 
       | It gets really weird when engineering priorities shift because of
       | these mostly irrelevant benchmarks.
        
       | doctorpangloss wrote:
       | The problem with the LLM benchmarks is that if you see one that
       | shows high performance by something that isn't from Anthropic,
       | Google or OpenAI, you don't believe it, even if it were "true."
       | In that sense, benchmarks are a holistic social experience in
       | this domain, less a scientific endeavour.
        
       | qustrolabe wrote:
       | Technically true but also a very dumb take and manipulative
       | phrasing
        
       | riskable wrote:
       | We should make a collective git repo full of every kind of
       | annoying bug we (expert developers) can think of. Then use _that_
       | to benchmark LLMs.
       | 
       | Someone want to start? I've got a Yjs/CRDT collaborative editing
       | bug that took like a week and a half of attempts with Claude Code
       | (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many
       | attempts to figure out. Even _then_ they didn 't really get it...
       | Just came up with a successful workaround (which is good enough
       | for me but still...).
       | 
       | Aside: You know what _really_ moved the progress bar on finding
       | and fixing the bug? When I had a moment of inspiration and made
       | the frontend send all it 's logs to the backend so the AIs could
       | see what was actually happening on the frontend (near real-time).
       | Really, I was just getting sick of manual testing and pasting the
       | console output into the chat (LOL). Laziness FTW!
       | 
       | I have the Google Chrome Dev Tools MCP but for some reason it
       | doesn't work as well :shrug:
        
         | simonw wrote:
         | Have you tried the Playwright libraries? Not the MCP, instead
         | telling Claude Code to use the Node.js or Python Playwright
         | libraries directly. I have had some really good results for
         | this for gnarly frontend challenges.
        
           | kvirani wrote:
           | Curious why not the MCP? I use that
        
             | s900mhz wrote:
             | When I have a bug I'm iterating on it's much easier and
             | faster to have it write out the playwright script. That way
             | it does not have to waste time or tokens performing the
             | same actions over and over again.
             | 
             | Think of it as TDD.
        
             | simonw wrote:
             | I don't really like MCPs, at least when I'm working with
             | coding agents like Claude Code or Codex CLI. I'd rather let
             | the agents write code that can do anything the underlying
             | library is capable of, rather than restricting them to just
             | the functionality that the MCP exposes.
             | 
             | It's more token efficient too since I don't need to load
             | the full MCP description into my context.
        
         | cortesoft wrote:
         | It would be pretty easy to over fit the results with a static
         | set of tests
        
         | bogwog wrote:
         | I can't tell how much of this is sarcasm
         | 
         | > we (expert developers) ...
         | 
         | > took like a week and a half of attempts with Claude Code ...
         | 
         | What kind of expert developer wastes that much time prompting a
         | bunch of different LLMs to end up with a workaround, instead of
         | actually debugging and fixing the bug themselves?
        
           | kvirani wrote:
           | Fair question but I think the tone of this is a bit abrasive
           | towards the poster, and unnecessarily so.
        
             | topato wrote:
             | there is a lot of disdain for vibe coding/coders, as Im
             | sure you already know. I was going to post something
             | similar as soon as I read a week and a half of prompts. I
             | pray that any gainfully employed expert coders don't spend
             | 10 days prompting, rather than coding lol
        
             | ambicapter wrote:
             | I really don't think so. "Expert" developer really needs to
             | mean something other than "prompting and poking at Claude
             | Code".
        
         | atn34 wrote:
         | I actually started a collection of annoying bugs I've seen in
         | the wild. I give the llm the buggy implementation and ask it to
         | write a test that catches it. So far not even a frontier model
         | (Claude Sonnet) can do it, even though they can find and fix
         | the bug itself.
        
           | embedding-shape wrote:
           | > even a frontier model (Claude Sonnet) can do it
           | 
           | Probably because Sonnet is no longer a frontier model, it
           | isn't even the best model Anthropic offers, according to
           | themselves.
        
         | embedding-shape wrote:
         | > We should make a collective git repo full of every kind of
         | annoying bug we (expert developers) can think of. Then use that
         | to benchmark LLMs.
         | 
         | I think any LLM-user worth their salt have been doing this
         | pretty much since we got API access to LLMs, as otherwise there
         | is no way to actually see if they can solve the things you care
         | about.
         | 
         | The only difference is that you must keep the actual benchmarks
         | to yourself, don't share them with anyone and even less put
         | them publicly. The second you do, you probably should stop
         | using it as an actual benchmark, as newly trained LLMs will
         | either intentionally or unintentionally slurp up your benchmark
         | and suddenly it's no longer a good indicator.
         | 
         | I think I personally started keeping my own test cases for
         | benchmarking around the GPT3 launch, when it became clear the
         | web will be effectively "poisoned" from that part on, and
         | anything on the public internet can be slurped up by the people
         | feeding the LLMs training data.
         | 
         | Once you have this up and running, you'll get a much more
         | measured view of how well new LLMs work, and you'll quickly see
         | that a lot of the fanfare doesn't actually hold up when testing
         | it against your own private benchmarks. On a happier note,
         | you'll also be surprised when a model suddenly does a lot
         | better in a specific area that wasn't even mentioned at
         | release, and then you could switch to it for specifically that
         | task :)
        
         | throw10920 wrote:
         | This may be intentional, but I'd like to point out that your
         | basically suggesting that others aggregate high-quality
         | training data for AI companies to use free of charge to replace
         | software engineers.
        
         | n_u wrote:
         | What was the CRDT bug?
        
       | jstummbillig wrote:
       | Benchmarks are like SAT scores. Can they guarantee you'll be
       | great at your future job? No, but we are still roughly okay with
       | what they signify. Clearly LLMs are getting better in meaningful
       | ways, and benchmarks correlate with that to some extend.
        
         | zeroonetwothree wrote:
         | There's no a priori reason to expect a test designed to test
         | human academic performance would be a good one to test LLM job
         | performance.
         | 
         | For example a test of "multiply 1765x9392" would have some
         | correlation with human intelligence but it wouldn't make sense
         | to apply it to computers.
        
           | sroussey wrote:
           | Actually... ask gpt1 to multiply 1765x9392.
        
             | SV_BubbleTime wrote:
             | I wish this was more broadly, explained to people...
             | 
             | There are LLMs, the engines that make these products run,
             | and then the products themselves.
             | 
             | GPT anything should not be asked math problems. LLMs are
             | language models, not math.
             | 
             | The line is going to get very blurry because ChatGPT, or
             | Claude or Gemini, are not LLM's. Their products driven by
             | LLMs.
             | 
             | The question or requisite should not be can my LLM do math.
             | It can I build a product that is LLM driven that can reason
             | through math problems. Those are different things.
             | 
             | A coworker of mine told me that GPT's LLM can use Excel
             | files. No, it can't. But the tools they plugged into it
             | can.
        
         | SV_BubbleTime wrote:
         | Isn't this like grading art critics?
         | 
         | We took objective computers, and made them generate subjective
         | results. Isn't this a problem that we already know there's no
         | solution to?
         | 
         | That grading subjectivity is just subjective itself.
        
         | pessimizer wrote:
         | People often use "clearly" or "obviously" to elide the subject
         | that is under discussion. People are saying that they do not
         | think that it is clear that LLMs are getting better in
         | meaningful ways, and they are saying that the benchmarks are
         | problematic. "Clearly" isn't a counterargument.
        
       | AbrahamParangi wrote:
       | A test doesn't need to be objectively meaningful or rigorous in
       | any sense in order to still be useful for comparative ranking.
        
         | hobs wrote:
         | yes it does - it has to be meaningful or rigorous for the
         | comparative ranking to be meaningful or rigorous, or else wtf
         | are you doing? Say I have all the information on my side but
         | only these questions that you are showing the user? Who cares
         | about that comparison?
        
           | JumpCrisscross wrote:
           | Yeah, "I ordered all the cars in the world from least to most
           | blue" produces a comparative ranking. It's just not a useful
           | one.
        
       | instagraham wrote:
       | I've written about Humanity's Last Exam, which crowdsources tough
       | questions for AI models from domain experts around the world.
       | 
       | https://www.happiesthealth.com/articles/future-of-health/hum...
       | 
       | It's a shifting goalpost, but one of the things that struck me
       | was how some questions could still be trivial for a fairly
       | qualified human (a doctor in this case) but difficult for an AI
       | model. Reasoning, visual or logic, is built on a set of
       | assumptions that are better gained through IRL experience than
       | crawling datasets and matching answers.
       | 
       | This leads me to believe that much of the future for training AI
       | models will lie in exposing them to "meatspace" and annotating
       | their inferences, much like how we train a child. This is a long,
       | long process, and one that is already underway at scale. But it's
       | what might give us emergent intelligences rather than just a
       | basket of competing yet somehow-magic thesaurus.
        
         | sroussey wrote:
         | Mercor is doing doing nine digit per year revenue doing just
         | that. Micro1 and others also.
        
       | zeroonetwothree wrote:
       | Humans are much better at out of sample prediction than LLMs. And
       | inherently benchmarks cannot be out of sample. So I believe that
       | leads to the disconnect between LLMs getting better and better at
       | in sample prediction (benchmarks) while not improving nearly as
       | much at out of sample (actual work).
        
       | twilightzone wrote:
       | "Measuring money turns out to be easier than measuring
       | intelligence." Don't ever change, El Reg.
        
       | dehrmann wrote:
       | This might explain the zeitgeist that new models feel same-ish,
       | despite model developers saying they're getting spectacularly
       | better.
        
       | inavida wrote:
       | They should laugh while they can ;) Still waiting for the crash
       | and to see what lives on and what gets recycled. My bet is that
       | grok is here to stay ;)
       | 
       | (Don't hurt me, I just like his chatbot. It's the best I've tried
       | at, "Find the passage in X that reminded me of the passage in Y
       | given this that and the other thing." It has a tendency to blow
       | smoke if you let it, but they all seek to affirm more than I'd
       | like, but ain't that the modern world? It can also be hilariously
       | funny in surprisingly apt ways.)
        
         | typpilol wrote:
         | Grok is terrible at coding though.
        
           | JumpCrisscross wrote:
           | If models get commoditised, distribution (and vertical
           | integration) become key. OpenAI and xAI are the only
           | companies that seem to be well hedged for this risk.
        
       | dupdup wrote:
       | for me the definition of AGI is the tool to measure
       | https://arxiv.org/html/2510.18212v2
        
       | dang wrote:
       | Url changed from
       | https://www.theregister.com/2025/11/07/measuring_ai_models_h...,
       | which points to this.
        
       | gradus_ad wrote:
       | AI detractors can say whatever. As a developer Claude Code is
       | almost an unfair cheat code. AI valuations may be absurd but the
       | hype is justified.
        
       ___________________________________________________________________
       (page generated 2025-11-08 23:00 UTC)