[HN Gopher] Study identifies weaknesses in how AI systems are ev...
___________________________________________________________________
Study identifies weaknesses in how AI systems are evaluated
Paper: https://openreview.net/pdf?id=mdA5lVvNcU Related:
https://www.theregister.com/2025/11/07/measuring_ai_models_h...
Author : pseudolus
Score : 264 points
Date : 2025-11-08 14:18 UTC (8 hours ago)
(HTM) web link (www.oii.ox.ac.uk)
(TXT) w3m dump (www.oii.ox.ac.uk)
| Marshferm wrote:
| Don't get high on your own supply.
| calpaterson wrote:
| Definitely one of the weaker areas in the current LLM boom.
| Comparing models, or even different versions of the same model,
| is a pseudo-scientific mess.
|
| I'm still using https://lmarena.ai/leaderboard. Perhaps there is
| something better and someone will pipe up to tell me about it.
| But we use LLMs at work and have unexplainable variations between
| them.
|
| And when we get a prompt working reliably on one model, we often
| have trouble porting it to another LLM - even straight "version
| upgrades" such as from GPT-4 to -5. Your prompt and your model
| become highly coupled quite easily.
|
| I dunno what to do about it and am tending to just pick Gemini as
| a result.
| ACCount37 wrote:
| Ratings on LMArena are too easily gamed.
|
| Even professional human evaluators are quite vulnerable to
| sycophancy and overconfident-and-wrong answers. And LMArena
| evaluators aren't professionals.
|
| A lot of the sycophancy mess that seeps from this generation of
| LLM stems from reckless tuning based on human feedback. Tuning
| for good LMArena performance has similar effects - and not at
| all by a coincidence.
| energy123 wrote:
| It's biased to small context performance, which is why I
| don't pay much attention to it as a developer aside from a
| quick glance. I need performance at 40-100k tokens which
| models like Deepseek can't deliver but Gemini 2.5 Pro and
| ChatGPT 5.0 Thinking can.
| ACCount37 wrote:
| And even "long term performance" splits itself into
| "performance on multi-turn instruction following" and
| "performance on agentic tasks" down the line. And
| "performance on agentic tasks" is a hydra in itself.
|
| Capturing LLM performance with a single metric is a
| hopeless task. But even a single flawed metric beats no
| metrics at all.
| HPsquared wrote:
| Psychometric testing of humans has a lot of difficulties, too.
| It's hard to measure some things.
| diamond559 wrote:
| I'd rather quit then be forced to beta test idiocracy. What's
| your company so we can all avoid it?
| botro wrote:
| This is something I've stuggled with for my site, I made
| https://aimodelreview.com/ to compare the outputs of LLMs over
| a variety of prompts and categories, allowing a side by side
| comparison between them. I ran each prompt 4 times for each
| model with different temperature values available as a toggles.
|
| My thinking was to just make the responses available to users
| and let them see how models perform. But from some feedback,
| turns out users don't want to have to evaluate the answers and
| would rather see a leaderboard and rankings.
|
| The scalable solution to that would be LLM as judge that some
| benchmarks already use, but that just feels wrong to me.
|
| LM Arena tries to solve this with the crowd sourced solution,
| but I think the right method would have to be domain expert
| human reviewers, so like Wirecutter VS IMDb, but that is
| expensive to pull off.
| bubblelicious wrote:
| I work on LLM benchmarks and human evals for a living in a
| research lab (as opposed to product). I can say: it's pretty much
| the Wild West and a total disaster. No one really has a good
| solution, and researchers are also in a huge rush and don't want
| to end up making their whole job benchmarking. Even if you could,
| and even if you have the right background you can do benchmarks
| full time and they still would be a mess.
|
| Product testing (with traditional A/B tests) are kind of the best
| bet since you can measure what you care about _directly_ and at
| scale.
|
| I would say there is of course "benchmarketing" but generally
| people do sincerely want to make good benchmarks it's just hard
| or impossible. For many of these problems we're hitting
| capabilities where we don't even have a decent paradigm to use,
| ACCount37 wrote:
| A/B testing is radioactive too. It's indirectly optimizing for
| user feedback - less stupid than directly optimizing for user
| feedback, but still quite dangerous.
|
| Human raters are exploitable, and you never know whether the B
| has a genuine performance advantage over A, or just found a
| meat exploit by an accident.
|
| It's what fucked OpenAI over with 4o, and fucked over many
| other labs in more subtle ways.
| bubblelicious wrote:
| Are you talking about just preferences or A/B tests on like
| retention and engagement? The latter I think is pretty
| reliable and powerful though I have never personally done
| them. Preferences are just as big a mess: WHO the annotators
| are matters, and if you are using preferences as a proxy for
| like correctness, you're not really measuring correctness
| you're measuring e.g. persuasion. A lot of construct validity
| challenges (which themselves are hard to even measure in
| domain).
| ACCount37 wrote:
| Yes. All of them are poisoned metrics, just in different
| ways.
|
| GPT-4o's endless sycophancy was great for retention,
| GPT-5's style of ending every response in a question is
| great for engagement.
|
| Are those desirable traits though? Doubt it. They look like
| simple tricks and reek of reward hacking - and A/B testing
| rewards them indeed. Direct optimization is even worse.
| Combining the two is ruinous.
|
| Mind, I'm not saying that those metrics are useless.
| Radioactive materials aren't useless. You just got to keep
| their unpleasant properties in mind at all times - or
| suffer the consequences.
| bjackman wrote:
| For what it's worth, I work on platforms infra at a hyperscaler
| and benchmarks are a complete fucking joke in my field too lol.
|
| Ultimately we are measuring extremely measurable things that
| have an objective ground truth. And yet:
|
| - we completely fail at statistics (the MAJORITY of analysis is
| literally just "here's the delta in the mean of these two
| samples". If I ever do see people gesturing at actual proper
| analysis, if prompted they'll always admit "yeah, well, we do
| come up with a p-value or a confidence interval, but we're
| pretty sure the way we calculate it is bullshit")
|
| - the benchmarks are almost never predictive of the performance
| of real world workloads anyway
|
| - we can obviously always just experiment in prod but then the
| noise levels are so high that you can entirely miss million-
| dollar losses. And by the time you get prod data you've already
| invested at best several engineer-weeks of effort.
|
| AND this is a field where the economic incentives for accurate
| predictions are enormous.
|
| In AI, you are measuring weird and fuzzy stuff, and you kinda
| have an incentive to just measure some noise that looks good
| for your stock price anyway. AND then there's contamination.
|
| Looking at it this way, it would be very surprising if the
| world of LLM benchmarks was anything but a complete and utter
| shitshow!
| bofadeez wrote:
| Even a p-value is insufficient. Maybe can use some of this
| stuff https://web.stanford.edu/~swager/causal_inf_book.pdf
| bjackman wrote:
| I have actually been thinking of hiring some training
| contractors to come in and teach people the basics of
| applied statistical inference. I think with a bit of
| internal selling, engineers would generally be interested
| enough to show up and pay attention. And I don't think we
| need very deep expertise, just a moderate bump in the
| ambient level of statistical awareness would probably go a
| long way.
|
| It's not like there's a shortage of skills in this area, it
| seems like our one specific industry just has a weird
| blindspot.
| stogot wrote:
| Don't most computer science programs require this? Mine
| had a statistics requirement
| jiggawatts wrote:
| "Here's the throughout at sustained 100% load with the same
| ten sample queries repeated over and over."
|
| "The customers want lower latency at 30% load for unique
| queries."
|
| "Err... we can scale up for more throughput!"
|
| tth_tth
| fragmede wrote:
| In AI though, you also have the world trying to compete with
| you, so even if you _do_ totally cheat and put the benchmark
| answers in your training set and over fit, if it turns out
| that you model sucks, it doesn 't matter how much your
| marketing department tells everyone you scored 110% on SWE
| bench, if it doesn't work out that well in production, your
| announcement's going to flow as users discover it doesn't
| work that well on their personal/internal secret benchmarks
| and tell /r/localLLAMA it isn't worth the download.
|
| Whatever happened with Llama 4?
| bofadeez wrote:
| Has your lab tried using any of the newer causal inference-
| style evaluation methods? Things like interventional or
| counterfactual benchmarking, or causal graphs to tease apart
| real reasoning gains from data or scale effects. Wondering if
| that's something you've looked into yet, or if it's still too
| experimental for practical benchmarking work.
| j45 wrote:
| What gets measured, gets managed and improved, though.
| andy99 wrote:
| I also work in LLM evaluation. My cynical take is that nobody
| is really using LLMs for stuff, and so benchmarks are mostly
| just make up tasks (coding is probably the exception). If we
| had real specific use cases it should be easier to benchmark
| and know if one is better, but it's mostly all hypothetical.
|
| The more generous take is that you can't benchmarks advanced
| intelligence very well, whether LLM or person. We don't have
| good procedures for assessing a person's fit-for-purpose e.g.
| for a job, certainly not standardized question sets. Why would
| we expect to be able to do this with AI?
|
| I think both of these takes are present to some extent in
| reality.
| jennyholzer wrote:
| I've been getting flagged by high-on-their-own-supply AI boosters
| for identifying that LLM benchmarks have been obvious bullshit
| for at least the last year and a half.
|
| What changed to make "the inevitable AI bubble" the dominant
| narrative in last week or so?
| conception wrote:
| Companies are talking about needing trillions of dollars is
| why.
| foobiekr wrote:
| And the government backstops.
|
| Nothing says confidence that AGI is imminent like needing the
| US government to prevent your investments from losing you
| money.
| HPsquared wrote:
| Benchmarks in general have this problem, across pretty much all
| industries. "When a measure becomes a target" and all that.
| icameron wrote:
| The market was down for AI related stocks especially, while
| down only over 3% it's the worst week since April, and there's
| no single event that is to blame it just looks like market
| sentiment has shifted away from the previous unchecked
| exuberance.
| Kiro wrote:
| Link those comments please because I checked your history and
| the flagged ones were pure nonsense with zero insights. Also,
| calling out LLM benchmarks has never been a radical take and
| basically the default on this site.
| purple_turtle wrote:
| It is possible to be right on the main theme but only by
| accident (with arguments and claims being wrong), communicating
| in a highly faulty way, with pointless insults, doing it in
| offtopic derails, being correct on minor point while being
| mostly wrong etc.
|
| Can you link some of these comments you consider useful but got
| flagged?
| shanev wrote:
| This is solvable at the level of an individual developer. Write
| your own benchmark for code problems that you've solved. Verify
| tests pass and that it satisfies your metrics like tok/s and
| TTFT. Create a harness that works with API keys or local models
| (if you're going that route).
| motoboi wrote:
| Well, openai github is open to write evaluations. Just add your
| there and guaranteed that the next model will perform better on
| them.
| cactusplant7374 wrote:
| I think that's what this site is doing:
| https://aistupidlevel.info/
| davedx wrote:
| That's called evals and yes any serious AI project uses them
| hamdingers wrote:
| At the developer level all my LLM use is in the context of
| agentic wrappers, so my benchmark is fairly trivial:
|
| Configure aider or claude code to use the new model, try to do
| some work. The benchmark is pass/fail, if after a little while
| I feel the performance is better than the last model I was
| using it's a pass, otherwise it's a fail and I go back.
|
| Building your own evaluations makes sense if you're serving an
| LLM up to customers and want to know how it performs, but if
| you are the user... use it and see how it goes. It's all
| subjective anyway.
| embedding-shape wrote:
| > Building your own evaluations makes sense if you're serving
| an LLM up to customers and want to know how it performs, but
| if you are the user... use it and see how it goes. It's all
| subjective anyway.
|
| I'd really caution against this approach, mainly because
| humans suck at removing emotions and other "human" factors
| when judging how well something works, but also because
| comparing across models gets a lot easier when you can see
| 77/100 vs 91/100 as a percentage score, over your own tasks
| that you actually use the LLMs for. Just don't share this
| benchmark publicly once you're using it for measurements.
| hamdingers wrote:
| So what? I'm the one that's using it, I happen to be a
| human, my human factor is the only one that matters.
|
| At this point anyone using these LLMs every day have seen
| those benchmark numbers go up without an appreciable
| improvement in the day to day experience.
| embedding-shape wrote:
| > So what? I'm the one that's using it, I happen to be a
| human, my human factor is the only one that matters.
|
| Yeah no you're right, if consistency isn't important to
| you as a human, then it doesn't matter. Personally, I
| don't trust my "humanness" and correctness is the most
| important thing for me when working with LLMs, so that's
| why my benchmarks focus on.
|
| > At this point anyone using these LLMs every day have
| seen those benchmark numbers go up without an appreciable
| improvement in the day to day experience.
|
| Yes, this is exactly my point. The benchmarks the makers
| of these LLMs seems to always provide a better and better
| score, yet the top scores in my own benchmarks have been
| more or less the same for the last 1.5 years, and I'm
| trying every LLM I can come across. These "the best LLM
| to date!" hardly ever actually is the "best available
| LLM", and while you could make that judgement by just
| playing around with LLMs, actually be able to point to
| specifically _why_ that is, is something at least I find
| useful, YMMV.
| j45 wrote:
| We have to keep in mind that "solving" might mean having the
| LLM recognize the pattern of solving something.
| bee_rider wrote:
| > "For example, if a benchmark reuses questions from a
| calculator-free exam such as AIME," the study says, "numbers in
| each problem will have been chosen to facilitate basic
| arithmetic. Testing only on these problems would not predict
| performance on larger numbers, where LLMs struggle."
|
| When models figure out how to exploit an effect that every clever
| college student does, that should count as a win. That's a much
| more human-like reasoning ability, than the ability to multiply
| large numbers or whatever (computers were already good at that,
| to the point that it has become a useless skill for humans to
| have). The point of these LLMs is to do things that computers
| were bad at.
| layer8 wrote:
| I don't think the fact that LLMs can handle small numbers more
| reliably has anything to do with their reasoning ability. To
| the contrary, reasoning ability should enable them to handle
| numbers of arbitrary size, just as it enables humans to do so,
| given some pencil and paper.
|
| However:
|
| > Testing only on these problems would not predict performance
| on larger numbers, where LLMs struggle.
|
| Since performance on large numbers is not what these exams are
| intended to test for, I don't see this as a counterargument,
| unless the benchmarks are misrepresenting what is being tested
| for.
| zamadatix wrote:
| Pencil and paper is just testing with tools enabled.
| layer8 wrote:
| You seem to be addressing an argument that wasn't made.
|
| Personally, I'd say that such tool use is more akin to a
| human using a calculator.
| zamadatix wrote:
| I'm not addressing an argument, just stating that's
| already a form of LLM testing done today for people
| wanting to look at the difference in results the same as
| the human analogy.
| layer8 wrote:
| Okay, but then I don't understand why you replied to my
| comment for that, there is no direct connection to what I
| wrote, nor to what bee_rider wrote.
| zamadatix wrote:
| > To the contrary, reasoning ability should enable them
| to handle numbers of arbitrary size, just as it enables
| humans to do so, given some pencil and paper.
|
| People interested can see the results of giving LLMs pen
| and paper today by looking at benchmarks with tools
| enabled. It's an addition to what you said, not an attack
| on a portion of your comment :).
| layer8 wrote:
| I see now. My focus was on the effect of LLMs' (and by
| analogy, humans') reasoning abilities argued by
| bee_rider. The fact that tool use can enable more
| reliable handling of large numbers has no bearing on
| that, hence I found the reply confusing.
| zamadatix wrote:
| Hmm, maybe it depends on the specific test and reasoning
| in it? I certainly think reasoning how and when to use
| allowed tools and when not to is a big part of the
| reasoning and verification process E.g. most human math
| scores allow for a pen and paper calculation, or even a
| calculator, and that can be a great way to say spot check
| a symbolic derivative and see it needs to be revisited
| without relying on the calculator/paper to do the actual
| reasoning for the testee. Or to see the equation for
| motion of a system can't possibly have been right with
| some test values (without which I'm not sure I'd have
| passed my mid level physics course haha).
|
| At the very least, the scores for benchmarking a human on
| such a test with and without tools would be different to
| comparing an LLM without the analogous constraints. Which
| is (IMO) a useful note in comparing reasoning abilities
| and why I thought it was interesting to note this kind of
| testing is just called testing with tools on the LLM side
| (not sure there is an equally as standard term on the
| human testing side? Guess the same could be used for both
| though).
|
| At the same time I'm sure other reasoning tests don't
| gain much from/expect use of tools at all. So it wouldn't
| be relevant for those reasoning tests.
| LadyCailin wrote:
| I'd say it's fair for LLMs to be able to use any tool in
| benchmarks, so long as they are the ones to decide to use
| them.
| zamadatix wrote:
| Agreed. I don't like when the prompt sets up a good
| portion of how to go about finding the answer by saying
| which tools to use and how. The LLM needs to decide when
| and how to use them, not the prompt.
| daveguy wrote:
| I don't think it should be completely open ended. I mean,
| you could have an "ask_hooman" tool that solves a ton of
| problems with current LLMs. But that doesn't mean the LLM
| is capable with respect to the benchmark.
| vntok wrote:
| Why not? One of the most intelligent things to do when
| stuck on a problem is to get outside help.
|
| If allowing this behaviour raises a problem, you can
| always add constraints to the benchmark such as "final
| answer must come out under 15s" or something. The LLM can
| then make the decision to ask around in accordance to the
| time risk.
| daveguy wrote:
| Because AI are good at devolving to the highest score,
| regardless of test intent. For most problems
| "ask_hooman", or especially the plural, would be much
| more effective. So, the degenerate case would dominate
| and tell you precisely zero about the intelligence of the
| AI. If a specific "tool" is more adept than the "AI" then
| "choose tool" will always be the correct answer. But I
| agree, a tight time constraint would help.
| Dylan16807 wrote:
| On some level this makes sense, but on the other hand LLMs
| already have perfect recall of thousands of symbols built
| into them, which is what pencil and paper gives to a human
| test taker.
| zamadatix wrote:
| If only context recall was actually perfect! The data is
| certainly stored well, accurately accessing the right
| part... maybe worse than a human :D.
| Dylan16807 wrote:
| If you're not doing clever hacks for very long windows, I
| thought a basic design fed in the entire window and it's
| up to the weights to use it properly.
| luke0016 wrote:
| > reasoning ability should enable them to handle numbers of
| arbitrary size, just as it enables humans to do so, given
| some pencil and paper.
|
| Or given a calculator. Which it's running on. Which it in
| some sense _is_. There 's something deeply ironic about the
| fact that we have an "AI" running on the most technologically
| advanced calculator in the history of mankind and...it can't
| do basic math.
| halJordan wrote:
| This is a very unserious take. It's not ironic, because
| it's not a calculator.
| rrr_oh_man wrote:
| What's meaning of `computer`, remind me quick?
| anamexis wrote:
| Computer vision algorithms run on computers and they
| can't do basic arithmetic.
|
| My email client runs on my computer and it doesn't do
| basic arithmetic either.
|
| Something running on a computer does not imply that it
| can or should do basic arithmetic
| TheOtherHobbes wrote:
| That's confusing basic arithmetic as a user feature and
| as an implementation requirement.
|
| I guarantee that computer vision and email clients both
| use basic arithmetic in implementation. And it would be
| trivially easy to bolt a calculator into an email app,
| because the languages used to write email apps include
| math features.
|
| That's not true of LLMs. There's math at the bottom of
| the stack. But LLMs run as a separate closed and opaque
| application of a unique and self-contained type, which
| isn't easily extensible.
|
| They don't include hooks into math features on the GPUs,
| and there's no easy way to add hooks.
|
| If you want math, you need a separate tool call to
| conventional code.
|
| IMO testing LLMs as if they "should" be able to do
| arithmetic is bizarre. They can't. They're not designed
| to. And even if they did, they'd be ridiculously
| inefficient at it.
| anamexis wrote:
| Yes, you are agreeing with me.
| gishh wrote:
| Pretty sure the only thing computer vision does is math.
|
| I've also observed email clients tallying the number of
| unread emails I have. It's quite obnoxious actually, but
| I qualify adding as math.
| anamexis wrote:
| Yes, everything that a computer does, it does using math.
| This does not imply that things running on the computer
| can do basic arithmetic tasks for the user.
| ghurtado wrote:
| > Pretty sure the only thing computer vision does is
| math.
|
| That is only marginally less pedantic than saying that
| the only thing computer vision does is run discrete
| electrical signals through billions of transistors.
| benjiro wrote:
| Thing is, a LLM is nothing but a prediction algorithm based
| upon what it trained. So it missing basic calculator
| functionality is a given. This is why tool usage is more
| and more a thing for LLMs. So that the LLM can from itself
| use a calculator for the actual math parts it needs. Thus
| increasing accuracy ...
| throwup238 wrote:
| Why is it a given? The universal approximation theorem
| should apply since addition is a continuous function. Now
| whether the network is sufficiently trained for that is
| another question but I don't think it's a given that a
| trillion parameter model can't approximate the most basic
| math operations.
|
| I think the tokenization is a bigger problem than the
| model itself.
| benjiro wrote:
| Easy to answer that one ... predictions are based upon
| accuracy. So if you have a int4 vs a float16, the chance
| that the prediction goes off is higher with a int4. But
| even with a float16, your still going to run into issues
| where your prediction model goes off. Its going to be a
| lot less, your still going to get rounding issue, what
| may result in a 5 being a 8 (just a example).
|
| So while it can look like a LLM calculates correctly, its
| still restricted by this accuracy issue. What happens
| when you get a single number wrong in a calculation,
| everything is wrong.
|
| While a calculator does not deal with predictions but
| basic adding/multiplying/subtracting etc .. Things that
| are 100% accurate (if we not not count issues like cosmic
| rays hitting, failures in silica etc).
|
| A trillion parameter model is just that, a trillion
| parameters, but what matter is not the tokens but the
| accuracy as in, the do they use int, float16, float32,
| float64 ... The issue is, the higher we go, the memory
| usage explodes.
|
| There is no point in spending terabytes of memory, to
| just get a somewhat accurate predictive calculator, when
| we can just have the LLM call a actual calculator, to
| ensure its results are accurate.
|
| Think of a LLM more like somebody with Dyslexia /
| Dyscalculia... It does not matter how good you are, all
| it takes is to switch one number in a algebraic
| calculation to get a 0/10 ... The reason why i mention
| this, is because i often think of a LLM like a person
| with Dyslexia / Dyscalculia. It can have insane
| knowledge, be smart, but be considered dumb by society
| because of that less then accurate prediction (or number
| swiping issue).
|
| Take it from somebody that wasted a few years in school
| thanks to that issue, it really does not matter if your a
| good programmer later in life, when you flunk a few years
| thanks to undiagnosed issues. And yet, just like a LLM, i
| simply rely on tool usage to fix my inaccuracy issues. No
| point in wasting good shoulder space trying to graft a
| dozen more heads/brains onto me, when i can simply
| delegate the issue away. ;)
|
| The fact that we can get computer models, that can almost
| program, write texts, ... and do so much more like a
| slightly malfunctioning human, amazes me. And at the same
| time, i curse at it like my teachers did, and also call
| it dumb at times hehehe ... I now understand how my
| teachers felt _loool_
| DrewADesign wrote:
| If they were selling LLMs as "LLMs" instead of magic
| code-writing, answer-giving PhD replacements, the lack of
| basic arithmetic capability _would_ be a _given_ ... but
| they _aren't._ Judging a paid service using their own
| implied claims is perfectly reasonable.
| novok wrote:
| This is like saying it's ironic that an alternator in a car
| cannot combust gasoline when the gasoline engine is right
| beside it, even though the alternator 'runs' on the
| gasoline engine.
| luke0016 wrote:
| Or similarly having a gasoline engine without an
| alternator and making the observation that there's an
| absurdity there in that you're generating large amounts
| of energy, yet aren't able to charge a relatively small
| 12V battery with any of it. It's a very practical and
| natural limitation, yet in some sense you have exactly
| what you want - energy - you just can't use it because of
| the form. If you step back there's an amusing irony
| buried in that. At least in my humble opinion :-)
| ambicapter wrote:
| > Since performance on large numbers is not what these exams
| are intended to test for,
|
| How so? Isn't the point of these exams to test arithmetic
| skills? I would hope we'd like arithmetic skills to be at a
| constant level regardless of the size of the number?
| singron wrote:
| No. AIME is a test for advanced high schoolers that mostly
| tests higher level math concepts like algebra and
| combinatorics. The arithmetic required is basic. All the
| answers are 3-digit numbers so that judging is objective
| and automated while making guessing infeasible. You have 12
| minutes on average for each question, so even if you are
| terribly slow at arithmetic, you should still be able to
| calculate the correct answer if you can perform all the
| other math.
| ambicapter wrote:
| That's probably a great test for high schoolers but it
| doesn't really test what we want from AI, no? I would
| expect AI to be limited by the far greater constraints of
| its computing ability, and not the working memory of a
| human high schooler.
| gardnr wrote:
| A discussion on models "figuring out" things:
| https://www.youtube.com/watch?v=Xx4Tpsk_fnM (Forbidden
| Technique)
| BolexNOLA wrote:
| >the point of these LLMs is to do things that computers were
| bad at.
|
| The way they're being deployed it feels like the point of LLMs
| is largely to replace basic online search or to run your online
| customer support cheaply.
|
| I'm a bit out on a limb here because this is not really my
| technical expertise by any stretch of the imagination, but it
| seems to me these benchmark tests don't really tell us much
| about how LLM's perform in the ways most people actually use
| them. Maybe I'm off base here though
| riskable wrote:
| Nobody really knows "the point" of LLMs yet. They weren't
| even "invented" as much as they _emerged_ as a trick to get
| computers to better understand human language.
|
| They're still brand spanking new and everyone's trying to
| figure out how to best use them. We don't even really know if
| they're _ever_ going to be "really good at" any given task!
|
| Are they "really good at" these things or are they merely
| "OK-ish"? * Answering factual questions.
| * Programming. * Understanding what the user wants
| from natural language. * Searching/recommending
| stuff.
|
| Real world testing suggests that with billions and billions
| of dollars spent, you really _can_ get an LLM to be "OK-ish"
| at all those things :D
| nradov wrote:
| LLMs can probably be taught or configured to use external tools
| like Excel or Mathematica when such calculations are needed.
| Just like humans. There are plenty of untapped optimization
| opportunities.
| davedx wrote:
| Chatgpt spins up a python sandbox for any complex
| calculations. It's been able to do that for a while now
| 6510 wrote:
| I don't claim to know anything but I thought tool usage was a
| major sign of intelligence. For example floats are a wonderful
| technology but people use them as if chainsaws are great for
| cutting bread and butter. We now have entire languages that
| cant do basic arithmetic. I thought it was alarming: _People it
| cant compute like this!_ Now we have language models, those are
| still computers, why cant we just give them.. you know...
| calculators? Arguably the best thing their universe has to
| offer.
|
| edit: I forgot my point: calculating big numbers is not a real
| world problem anyone has.
| yunyu wrote:
| We do? Tool use started coming in vogue around 2023
| riskable wrote:
| Actually, tool use started coming into vogue around 3.3
| million years ago.
| novok wrote:
| IMO I think the calculator problem goes away with tool use or
| NN architectures that basically add a calculator equivalent as
| one of the potential 'experts' or similar. It won't be much of
| a trope for longer.
| davedx wrote:
| Chatgpt has been calculating things in its python sandbox for
| years already. This is a trope indeed
| jvanderbot wrote:
| Absolutely not.
|
| College exam takers use those tricks because they are on a time
| limit and are gaming the system. It's clever and wink wink
| nudge nudge ok everyone does it. But it's one tiny signal in a
| huge spectrum of things we use to evaluate _people_.
|
| Instead, these metrics are gamed and presented as the _entire_
| multi special signal of competence for LLMs because it is
| literally impossible to say that success in one domain would
| translate the way it might with a good hire.
|
| What I want is something I don't have to guard against gaming.
| Something conscientious and capable like my co workers. Until
| then it's google version 2 married to intellisense and I'm not
| letting do anything by itself.
| joe_the_user wrote:
| _The point of these LLMs is to do things that computers were
| bad at._
|
| That's a good point imo but we achieved this stuff by at least
| 2022 when ChatGPT was released. The thing about these giant
| black boxes is that they also fail to do things that directly
| human-written software ("computers") does easily. The inability
| to print text onto generated images or do general arithmetic is
| important. And sure, some of these limits look like "limits of
| humans". But it is important to avoid jumping from "they do
| this human-thing" to "they're like humans".
| moritzwarhier wrote:
| When people claim that there is such a thing as "X% accuracy in
| reasoning", it's really hard to take anything else seriously, no
| matter how impressive.
|
| AI (and humans!) aside, claiming that there was an oracle that
| could "answer all questions" is a solved problem. Such a thing
| cannot exist.
|
| But this is going already too deep IMO.
|
| When people start talking about percentages or benchmark scores,
| there has to be some denominator.
|
| And there can be no bias-free such denominator for
|
| - trivia questions
|
| - mathematical questions (oh, maybe I'm wrong here, intuitively
| I'd say it's impossible for various reasons: varying "hardness",
| undecidable problems etc)
|
| - historical or policital questions
|
| I wanted to include "software development tasks", but it would be
| a distraction. Maybe there will be a good benchmark for this, I'm
| aware there are plenty already. Maybe AI will be capable to be a
| better software developer than me in some capacity, so I don't
| want to include this part here. That also maps pretty well to
| "the better the problem description, the better the output",
| which doesn't seem to work so neatly with the other categories of
| tasks and questions.
|
| Even if the whole body of questions/tasks/prompts would be very
| constrained and cover only a single domain, it seems impossible
| to guarantee that such benchmark is "bias-free" (I know AGI folks
| love this word).
|
| Maybe in some interesting special cases? For example, very
| constrained and clearly defined classes of questions, at which
| point, the "language" part of LLMs seems to become less important
| and more of a distraction. Sure, AI is not just LLMs, and LLMs
| are not just assistants, and Neural Networks are not just LLMs...
|
| There the problem begins to be honest: I don't even know how to
| align the "benchmark" claims with the kind of AI they are
| examinin and the ones I know exist.
|
| Sure it's possible to benchmark how well an AI decides whether,
| for example, a picture shows a rabbit. Even then: for some
| pictures, it's gotta be undecidable, no matter how good the
| training data is?
|
| I'm just a complete layman and commenting about this; I'm not
| even fluent in the absolute basics of artificial neural networks
| like perceptrons, gradient descent, backpropagation and typical
| non-LLM CNNs that are used today, GANs etc.
|
| I am and was impressed by AI and deep learning, but to this day I
| am thorougly disappointed by the hubris of snakeoil salespeople
| who think it's valuable and meaningful to "benchmark" machines on
| "general reasoning".
|
| I mean, it's already a thing in humans. There are IQ tests for
| the non-trivia parts. And even these have plenty of discussion
| revolving around them, for good reason.
|
| Is there some "AI benchmark" that exclusively focuses on doing
| recent IQ tests on models, preferably editions that were
| published after the particular knowledge cutoff of the respective
| models? I found (for example) this study [1], but to be honest,
| I'm not the kind of person who is able to get the core insights
| presented in such a paper by skimming through it.
|
| Because I think there _are_ impressive results, it 's just
| becomimg very hard to see through the bullshit at as an average
| person.
|
| I would also love to understand mroe about the current state of
| the research on the "LLMs as compression" topic [2][3].
|
| [1] https://arxiv.org/pdf/2507.20208
|
| [2] https://www.mattmahoney.net/dc/text.html
|
| [3] https://arxiv.org/abs/2410.21352
| SurceBeats wrote:
| Benchmarks optimize for fundraising, not users. The gap between
| "state of the art" and "previous gen" keeps shrinking in real-
| world use, but investors still write checks based on decimal
| points in test scores.
| yawnxyz wrote:
| we try to make benchmarks for users, but it's like that 20%
| article - different people want different 20% and you just end
| up adding "features" and whackamoling the different kinds of
| 20%
|
| if a single benchmark could be a universal truth, and it was
| easy to figure out how to do it, everyone would love that.. but
| that's why we're in the state we're in right now
| DrewADesign wrote:
| The problem isn't with the benchmarks (or the models, for
| that matter) it's their being used to prop up the
| indefensible _product_ marketing claims made by people
| frantically justifying asking for more dump trucks of
| thousand-dollar bills to replace the ones they just burned
| through in a few months.
| RA_Fisher wrote:
| For statistical AI models, we can use out of sample prediction
| error as an objective measure to compare models. What makes
| evaluating LLMs difficult is that comparisons are inextricable
| from utility (whereas statistical AI models do have a pre-utility
| step wherein it can be shown out of sample prediction epsilon is
| minimized).
| SkyPuncher wrote:
| Benchmarks are nothing more than highly contextual specs (in
| traditional code). They demonstrate your code works in a certain
| way in certain use cases, but they do not prove your code works
| as expected in all use cases.
| embedding-shape wrote:
| > Program testing can be used to show the presence of bugs, but
| never to show their absence. Edsger W. Dijkstra
|
| Maybe we need something similar for benchmarks, and updated for
| today's LLMs, like:
|
| > LLM benchmarks can be used to show what tasks they can do,
| but never to show what tasks they cannot.
| pahae wrote:
| I wish the big providers would offer some sort of trial period
| where you can evaluate models in a _realistic_ setting yourself
| (i.e cli tools or IDE integrations). I wouldn't even mind strict
| limits -- just give me two hours or so of usage and I'd already
| be happy. Seriously.
|
| My use-case is probably pretty far from the usual tasks: I'm
| currently implementing a full observability platform based on
| VictoriaMetrics / Victorialogs + Grafana. It's quite elaborate
| and has practically no overlap with the usual/cloud solutions you
| find out there. For example, it uses an authenticated query
| stack: I use the Grafana oauth token to authenticate queries by
| injecting matchers via prom-label-proxy and forward that to
| promxy for fan-out to different datasources (using the label
| filter to only query some datasources). The IaC stuff is also not
| mainstream as I'm not using any of the big cloud providers, but
| the provider I use nonetheless has a terraform provider.
|
| As you can imagine, there's probably not much training data for
| most of this, so quality of the responses varies widely. From my
| experience so far Claude (Sonnet 4.5 ) does a _much_ better job
| than GTP-5 (Codex or normal) with the day-to-day task. Stuff like
| keeping documentation up to date, spotting inconsistencies,
| helping me find blind spots in the Alerting rules, etc. It also
| seems to do better working with provided documentation / links.
|
| I've been using Claude for a couple of weeks now but recently
| switched to codex after my subscription to Claude ran out. I was
| really curious after reading a lot of good things about it but I
| gotta say, so far, I'm not impressed. Compared to Claude it gives
| wrong answers much more frequently (at least in this domain). The
| results it produces take much more effort to clean up than
| Claude's. Probably on a level where I could just invest the time
| myself. Might be that I do not yet know how to correctly prompt
| GPT but giving both tools the same prompt, Claude does a better
| job 90% of the time.
|
| Anyway, I guess this is my long-winded way of saying that the
| quality of responses "off the beaten track" varies widely and is
| worth testing several models with. Especially if your work is not
| 70+% of coding. Even then I guess that many benchmarks have
| seized being useful by now?
| cavisne wrote:
| https://wuu73.org/blog/aiguide1.html
|
| You can get a lot of free usage out of the models.
| tim333 wrote:
| There's the github copilot 30 day trial? "Access to Anthropic
| Claude Sonnet 4, GPT-5, Gemini 2.5 Pro, and more 300 premium
| requests to use the latest models and code review"
| bbor wrote:
| I'm already quite put off by the title (it's science -- if you
| have a better benchmark, publish it!), but the contents aren't
| great either. It keeps citing numbers about "445 LLM benchmarks"
| without confirming whether any of the ones they deem
| insufficiently statistical are used by any of the major players.
| I've seen a lot of benchmarks, but _maybe_ 20 are used regularly
| by large labs, max. "For example, if a benchmark
| reuses questions from a calculator-free exam such as AIME," the
| study says, "numbers in each problem will have been chosen to
| facilitate basic arithmetic. Testing only on these problems would
| not predict performance on larger numbers, where LLMs struggle."
|
| For a math-based critique, this seems to ignore a glaring
| problem: is it even _possible_ to randomly sample all natural
| numbers? As another comment pointed out we wouldn 't even want to
| ("LLMs can't accurately multiply 6-digit numbers" isn't something
| anyone cares about/expected them to do in the first place), but
| regardless: this seems like a vacuous critique dressed up in a
| costume of mathematical rigor. At least some of
| those who design benchmark tests are aware of these concerns.
|
| In related news, at least some scientists studying climate change
| are aware that their methods are imperfect. More at 11!
|
| If anyone doubts my concerns and thinks this article is in good
| faith, just check out this site's "AI+ML" section:
| https://www.theregister.com/software/ai_ml/
| daveguy wrote:
| The article references this review:
|
| https://openreview.net/pdf?id=mdA5lVvNcU
|
| And the review is pretty damning regarding statistical validity
| of LLM benchmarks.
| dang wrote:
| (We've since changed both title and URL - see
| https://news.ycombinator.com/item?id=45860056)
| wolttam wrote:
| I'd like to see some video generation benchmarks. For example,
| one that tested a model's ability to generate POV footage of a
| humanoid form carrying out typical household tasks
|
| Even if it requires human evaluators at first, and even if the
| models _completely suck_ at this task right now: it seems like
| the kind of task you 'd want them to be good at, if you want
| these models to eventually carry out these tasks in embodied
| forms in the real world.
|
| Just having the benchmark in the first place is what gives model
| makers something to optimize for.
| luckydata wrote:
| Generating footage wouldn't help with the opposite but
| navigating a simulation would which is a pretty standard type
| of evaluation for multimodal AIs designed to act in the real
| world.
| wolttam wrote:
| Do you mean that it wouldn't help with ingesting footage and
| then determining how to act?
|
| I can imagine a robotics architecture where you have one
| model generating footage (next frames for what it is
| currently seeing) and another dumber model which takes in the
| generated footage and only knows how to generate the
| motor/servo control outputs needed to control whatever robot
| platform it is integrated with.
|
| I think that kind of architecture decoupling would be nice.
| It allows the model with all the world and task-specific
| knowledge to be agnostic from its underlying robot platform.
| lysace wrote:
| Tech companies/bloggers/press/etc are perpetually bad at
| benchmarks. For browsers they kept pushing simplistic javascript-
| centric benchmarks even when it was clear for at least 15 years
| that layout/paint/network/etc were the dominant bottlenecks in
| real-world usage.
|
| It's primarily marketing-driven. I think the technical parts of
| companies need to attempt to own this more.
|
| It gets really weird when engineering priorities shift because of
| these mostly irrelevant benchmarks.
| doctorpangloss wrote:
| The problem with the LLM benchmarks is that if you see one that
| shows high performance by something that isn't from Anthropic,
| Google or OpenAI, you don't believe it, even if it were "true."
| In that sense, benchmarks are a holistic social experience in
| this domain, less a scientific endeavour.
| qustrolabe wrote:
| Technically true but also a very dumb take and manipulative
| phrasing
| riskable wrote:
| We should make a collective git repo full of every kind of
| annoying bug we (expert developers) can think of. Then use _that_
| to benchmark LLMs.
|
| Someone want to start? I've got a Yjs/CRDT collaborative editing
| bug that took like a week and a half of attempts with Claude Code
| (Sonnet 4.5), GPT5-codex (medium), and GLM-4.6 many, many
| attempts to figure out. Even _then_ they didn 't really get it...
| Just came up with a successful workaround (which is good enough
| for me but still...).
|
| Aside: You know what _really_ moved the progress bar on finding
| and fixing the bug? When I had a moment of inspiration and made
| the frontend send all it 's logs to the backend so the AIs could
| see what was actually happening on the frontend (near real-time).
| Really, I was just getting sick of manual testing and pasting the
| console output into the chat (LOL). Laziness FTW!
|
| I have the Google Chrome Dev Tools MCP but for some reason it
| doesn't work as well :shrug:
| simonw wrote:
| Have you tried the Playwright libraries? Not the MCP, instead
| telling Claude Code to use the Node.js or Python Playwright
| libraries directly. I have had some really good results for
| this for gnarly frontend challenges.
| kvirani wrote:
| Curious why not the MCP? I use that
| s900mhz wrote:
| When I have a bug I'm iterating on it's much easier and
| faster to have it write out the playwright script. That way
| it does not have to waste time or tokens performing the
| same actions over and over again.
|
| Think of it as TDD.
| simonw wrote:
| I don't really like MCPs, at least when I'm working with
| coding agents like Claude Code or Codex CLI. I'd rather let
| the agents write code that can do anything the underlying
| library is capable of, rather than restricting them to just
| the functionality that the MCP exposes.
|
| It's more token efficient too since I don't need to load
| the full MCP description into my context.
| cortesoft wrote:
| It would be pretty easy to over fit the results with a static
| set of tests
| bogwog wrote:
| I can't tell how much of this is sarcasm
|
| > we (expert developers) ...
|
| > took like a week and a half of attempts with Claude Code ...
|
| What kind of expert developer wastes that much time prompting a
| bunch of different LLMs to end up with a workaround, instead of
| actually debugging and fixing the bug themselves?
| kvirani wrote:
| Fair question but I think the tone of this is a bit abrasive
| towards the poster, and unnecessarily so.
| topato wrote:
| there is a lot of disdain for vibe coding/coders, as Im
| sure you already know. I was going to post something
| similar as soon as I read a week and a half of prompts. I
| pray that any gainfully employed expert coders don't spend
| 10 days prompting, rather than coding lol
| ambicapter wrote:
| I really don't think so. "Expert" developer really needs to
| mean something other than "prompting and poking at Claude
| Code".
| atn34 wrote:
| I actually started a collection of annoying bugs I've seen in
| the wild. I give the llm the buggy implementation and ask it to
| write a test that catches it. So far not even a frontier model
| (Claude Sonnet) can do it, even though they can find and fix
| the bug itself.
| embedding-shape wrote:
| > even a frontier model (Claude Sonnet) can do it
|
| Probably because Sonnet is no longer a frontier model, it
| isn't even the best model Anthropic offers, according to
| themselves.
| embedding-shape wrote:
| > We should make a collective git repo full of every kind of
| annoying bug we (expert developers) can think of. Then use that
| to benchmark LLMs.
|
| I think any LLM-user worth their salt have been doing this
| pretty much since we got API access to LLMs, as otherwise there
| is no way to actually see if they can solve the things you care
| about.
|
| The only difference is that you must keep the actual benchmarks
| to yourself, don't share them with anyone and even less put
| them publicly. The second you do, you probably should stop
| using it as an actual benchmark, as newly trained LLMs will
| either intentionally or unintentionally slurp up your benchmark
| and suddenly it's no longer a good indicator.
|
| I think I personally started keeping my own test cases for
| benchmarking around the GPT3 launch, when it became clear the
| web will be effectively "poisoned" from that part on, and
| anything on the public internet can be slurped up by the people
| feeding the LLMs training data.
|
| Once you have this up and running, you'll get a much more
| measured view of how well new LLMs work, and you'll quickly see
| that a lot of the fanfare doesn't actually hold up when testing
| it against your own private benchmarks. On a happier note,
| you'll also be surprised when a model suddenly does a lot
| better in a specific area that wasn't even mentioned at
| release, and then you could switch to it for specifically that
| task :)
| throw10920 wrote:
| This may be intentional, but I'd like to point out that your
| basically suggesting that others aggregate high-quality
| training data for AI companies to use free of charge to replace
| software engineers.
| n_u wrote:
| What was the CRDT bug?
| jstummbillig wrote:
| Benchmarks are like SAT scores. Can they guarantee you'll be
| great at your future job? No, but we are still roughly okay with
| what they signify. Clearly LLMs are getting better in meaningful
| ways, and benchmarks correlate with that to some extend.
| zeroonetwothree wrote:
| There's no a priori reason to expect a test designed to test
| human academic performance would be a good one to test LLM job
| performance.
|
| For example a test of "multiply 1765x9392" would have some
| correlation with human intelligence but it wouldn't make sense
| to apply it to computers.
| sroussey wrote:
| Actually... ask gpt1 to multiply 1765x9392.
| SV_BubbleTime wrote:
| I wish this was more broadly, explained to people...
|
| There are LLMs, the engines that make these products run,
| and then the products themselves.
|
| GPT anything should not be asked math problems. LLMs are
| language models, not math.
|
| The line is going to get very blurry because ChatGPT, or
| Claude or Gemini, are not LLM's. Their products driven by
| LLMs.
|
| The question or requisite should not be can my LLM do math.
| It can I build a product that is LLM driven that can reason
| through math problems. Those are different things.
|
| A coworker of mine told me that GPT's LLM can use Excel
| files. No, it can't. But the tools they plugged into it
| can.
| SV_BubbleTime wrote:
| Isn't this like grading art critics?
|
| We took objective computers, and made them generate subjective
| results. Isn't this a problem that we already know there's no
| solution to?
|
| That grading subjectivity is just subjective itself.
| pessimizer wrote:
| People often use "clearly" or "obviously" to elide the subject
| that is under discussion. People are saying that they do not
| think that it is clear that LLMs are getting better in
| meaningful ways, and they are saying that the benchmarks are
| problematic. "Clearly" isn't a counterargument.
| AbrahamParangi wrote:
| A test doesn't need to be objectively meaningful or rigorous in
| any sense in order to still be useful for comparative ranking.
| hobs wrote:
| yes it does - it has to be meaningful or rigorous for the
| comparative ranking to be meaningful or rigorous, or else wtf
| are you doing? Say I have all the information on my side but
| only these questions that you are showing the user? Who cares
| about that comparison?
| JumpCrisscross wrote:
| Yeah, "I ordered all the cars in the world from least to most
| blue" produces a comparative ranking. It's just not a useful
| one.
| instagraham wrote:
| I've written about Humanity's Last Exam, which crowdsources tough
| questions for AI models from domain experts around the world.
|
| https://www.happiesthealth.com/articles/future-of-health/hum...
|
| It's a shifting goalpost, but one of the things that struck me
| was how some questions could still be trivial for a fairly
| qualified human (a doctor in this case) but difficult for an AI
| model. Reasoning, visual or logic, is built on a set of
| assumptions that are better gained through IRL experience than
| crawling datasets and matching answers.
|
| This leads me to believe that much of the future for training AI
| models will lie in exposing them to "meatspace" and annotating
| their inferences, much like how we train a child. This is a long,
| long process, and one that is already underway at scale. But it's
| what might give us emergent intelligences rather than just a
| basket of competing yet somehow-magic thesaurus.
| sroussey wrote:
| Mercor is doing doing nine digit per year revenue doing just
| that. Micro1 and others also.
| zeroonetwothree wrote:
| Humans are much better at out of sample prediction than LLMs. And
| inherently benchmarks cannot be out of sample. So I believe that
| leads to the disconnect between LLMs getting better and better at
| in sample prediction (benchmarks) while not improving nearly as
| much at out of sample (actual work).
| twilightzone wrote:
| "Measuring money turns out to be easier than measuring
| intelligence." Don't ever change, El Reg.
| dehrmann wrote:
| This might explain the zeitgeist that new models feel same-ish,
| despite model developers saying they're getting spectacularly
| better.
| inavida wrote:
| They should laugh while they can ;) Still waiting for the crash
| and to see what lives on and what gets recycled. My bet is that
| grok is here to stay ;)
|
| (Don't hurt me, I just like his chatbot. It's the best I've tried
| at, "Find the passage in X that reminded me of the passage in Y
| given this that and the other thing." It has a tendency to blow
| smoke if you let it, but they all seek to affirm more than I'd
| like, but ain't that the modern world? It can also be hilariously
| funny in surprisingly apt ways.)
| typpilol wrote:
| Grok is terrible at coding though.
| JumpCrisscross wrote:
| If models get commoditised, distribution (and vertical
| integration) become key. OpenAI and xAI are the only
| companies that seem to be well hedged for this risk.
| dupdup wrote:
| for me the definition of AGI is the tool to measure
| https://arxiv.org/html/2510.18212v2
| dang wrote:
| Url changed from
| https://www.theregister.com/2025/11/07/measuring_ai_models_h...,
| which points to this.
| gradus_ad wrote:
| AI detractors can say whatever. As a developer Claude Code is
| almost an unfair cheat code. AI valuations may be absurd but the
| hype is justified.
___________________________________________________________________
(page generated 2025-11-08 23:00 UTC)