[HN Gopher] The Leaderboard Illusion
___________________________________________________________________
The Leaderboard Illusion
Author : pongogogo
Score : 139 points
Date : 2025-04-30 07:58 UTC (15 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| pongogogo wrote:
| I think this is a really interesting paper from Cohere, it really
| feels that at this point in time you can't trust any public
| benchmark, and you really need your own private evals.
| ilrwbwrkhv wrote:
| Yup in my private evals I have repeatedly found that DeepSeek
| has the best models for everything and yet in a lot of these
| public ones it always seems like someone else is on the top. I
| don't know why.
| __alexs wrote:
| Publishing them might help you find out.
| refulgentis wrote:
| ^ This.
|
| If I had to hazard a guess, as a poor soul doomed to
| maintain several closed and open source models acting
| agentically, I think you are hyper focused on chat trivia
| use cases (DeepSeek has a very, very, hard time tool
| calling and they say as much themselves in their API docs)
| AstroBen wrote:
| Any tips on coming up with good private evals?
| pongogogo wrote:
| Yes, I wrote something up here on how Andrei Kaparthy
| evaluated grok 3 ->
| https://tomhipwell.co/blog/karpathy_s_vibes_check/
|
| I would pick one of two parts of that analysis that are most
| relevant to you and zoom in. I'd choose something difficult
| that the model fails at, then look carefully at how the model
| failures change as you test different model generations.
| aredox wrote:
| The fact those big LLM developers devote a significant amount of
| effort to game benchmarks is a big show of confidence that they
| are making progress towards AGI and will recoup those billions of
| dollars and man-hours/s
| amelius wrote:
| Are the benchmark prompts public and isn't that where the
| problem lies?
| StevenWaterman wrote:
| No, even if the benchmarks are private, it's still an issue.
| Because you can overfit to the benchmark by trying X random
| variations of the model, and picking the one that performs
| best on the benchmark
|
| It's similar to how I can pass any multiple-choice exam if
| you let me keep attempting it and tell me my overall score at
| the end of each attempt - even if you don't tell me which
| answers were right/wrong
| amelius wrote:
| Maybe there should be some rate limiting on it then? I.e.,
| once a month you can benchmark your model. Of course you
| can submit under different names, but how many company
| names can someone realistically come up with and register?
| sebastiennight wrote:
| So now you want OpenAI to go _even wilder_ in how they
| name each new model?
| amelius wrote:
| 1 model per company per month, max.
| VladVladikoff wrote:
| Now I'm wondering what the most efficient algorithm to
| obtain a mark of 100% in the least amount of attempts.
| Guessing one question per attempt seems inefficient.
| Perhaps guessing the whole exam as option A. Then
| submitting the whole exam as option B. And so on, at the
| start, could give you a count of how many As are correct.
| Then maybe some sort of binary sort through the rest of the
| options? You could submit the first 1/2 as A and the second
| 1/2 as B. Etc. hmmmm
| amelius wrote:
| Maybe an llm can tell you how to best approach this
| problem ;)
| leto_ii wrote:
| Is this sarcasm? Otherwise I'm not sure how that follows. Seems
| more reasonable to believe that they're hitting walls and
| switching to PR and productizing.
| RodgerTheGreat wrote:
| Ending a paragraph with "/s" is a moderately common
| convention for conveying a sarcastic tone through text.
| Terr_ wrote:
| I believe they _are_ being sarcastic, but Poe 's Law is in
| play and it's too ambiguous for practical purposes.
| unkulunkulu wrote:
| Sounds like classic inequality observed everywhere. Success leads
| to attention leads to more success.
|
| Why spend evaluation resources on outsiders? Everyone wants to
| know who is exactly first second etc, after #10 it's do your own
| evaluation if this is important to you.
|
| Thus, we have this inequality.
| boxed wrote:
| Is it? Sounds to me like they run the same experiment many
| times and keep the "best" results. Which is cheating, or if the
| same thing is done in biomedical research: research fraud.
| sumtechguy wrote:
| Back in the slashdot days I would experiment on changing
| conversations. This was due to the way SD would rank and show
| its posts. Anything below a 3 would not change anything. But
| if you could get in early AND get a +5 on your post you could
| drive exactly what the conversation was about. Especially if
| you were engaged a bit and were willing to add a few more
| posts onto other posts.
|
| Basically get in early and get a high rank and you are
| usually going to 'win'. Now it does not work all the time.
| But it had a very high success rate. I probably should have
| studied it a bit more. My theory is any stack ranking
| algorithm is susceptible to it. I also suspect it works
| decently well due to the way people will create puppet
| accounts to up rank things on different platforms. But you
| know, need numbers to back that up...
| cratermoon wrote:
| Anecdotally, that same technique works on HN.
| sunaookami wrote:
| And Reddit
| jerf wrote:
| It's intrinsic to any karma system that has a global
| karma rating, that is, the message has a concrete "karma"
| value that is the same for all users.
|
| drcongo recently referenced something I sort of wish I
| had time to build:
| https://news.ycombinator.com/item?id=43843116 And/or
| could just go somewhere to use, which is a system where
| an upvote doesn't mean " _everybody_ needs to see this
| more " but instead means " _I_ want to see more of this
| user 's comments", and downvotes mean the corresponding
| opposite. It's more computationally difficult but would
| create an interestingly different community, especially
| as further elaborations were built on that. One of the
| differences would be to mitigate the first-mover
| advantage in conversations. Instead of it winning you
| more karma if it appeals to the general public of the
| relevant site, what it would instead do is expose you to
| more people. That would produce more upvotes and
| downvotes in general but wouldn't necessarily impact
| visibility in the same way.
| all2 wrote:
| I'm building a simple community site (a HN clone) and I
| haven't gotten to the ranking algorithms yet. I'm very
| curious about how this could work.
| cainxinth wrote:
| So attention _is_ all you need?
| ukuina wrote:
| Bravo!
| jmmcd wrote:
| Absolutely devastating for the credibility of FAIR.
| sheepdestroyer wrote:
| I thought the latest llama were not from FAIR but from the
| genai team
| ekidd wrote:
| Also, I've been hearing a lot of complaints that Chatbot Arena
| tends to favor:
|
| - Lots of bullet points in every response.
|
| - Emoji.
|
| ...even at the expense of accurate answers. And I'm beginning to
| wonder if the sycophantic behavior of recent models ("That's a
| brilliant and profound idea") is also being driven by Arena
| scores.
|
| Perhaps LLM users actually do want lots of bullets, emoji and
| fawning praise. But this seems like a perverse dynamic, similar
| to the way that social media users often engage more with content
| that outrages them.
| kozikow wrote:
| More to that - at this point, it feels to me, that arenas are
| getting too focused on fitting user preferences rather than
| actual model quality.
|
| In reality I prefer different model, for different things, and
| quite often it's because model X is tuned to return more of my
| preference - e.g. Gemini tends to be usually the best in non-
| english, chatgpt works better for me personally for health
| questions, ...
| jimmaswell wrote:
| > sycophantic behavior of recent models
|
| The funniest example I've seen recently was "Dude. You just
| said something deep as hell without even flinching. You're
| 1000% right:"
| pc86 wrote:
| This type of response is the quickest way for me to start
| verbally abusing the LLM.
| n8m8 wrote:
| Interesting idea, I think I'm on board with this correlation
| hypothesis. Obviously it's complicated, but it does seems like
| over-reliance on arbitrary opinions from average people would
| result in valuing "feeling" over correctness.
| lostmsu wrote:
| Chiming in as usual: https://trashtalk.borg.games
|
| A social deduction game for both LLMs and humans. All the past
| games are available for anyone.
|
| I'm open for feedback.
| n8m8 wrote:
| Predictable, yet incredibly important.
| jmount wrote:
| Not the same effect: but a good related writeup:
| https://www.stefanmesken.info/machine%20learning/how-to-beat...
| bob1029 wrote:
| > Any observed statistical regularity will tend to collapse once
| pressure is placed upon it for control purposes.
|
| In context of genetic programming and other non-traditional ML
| techniques, I've been having difficulty attempting to locate a
| simple fitness function that reliably proxies natural language
| string similarity due to this effect.
|
| For example, say you use something like common prefix length to
| measure how close a candidate's output string is to an objective
| string given an input string. The underlying learner will
| inevitably start doing things like repeating the input verbatim,
| especially if the input/output training tuples often share a lot
| of prefixes. So, you might try doing something like reversing the
| input to force learning to take a less crappy path [0]. The
| learner may respond degenerately by inventing a string reversing
| technique and repeating its prior behavior. So, you iterate again
| and try something like base64 encoding the input. This might
| take, but eventually you wind up with so many weird hacks that
| the learner can't make progress and the meaning of the quantities
| evaporates.
|
| Every metric I've ever looked at gets cheated in some way. The
| holy grail is probably normalized information distance
| (approximated by normalized compression distance), but then you
| have a whole new problem of finding an ideal universal compressor
| which definitely doesn't exist.
|
| [0]: https://arxiv.org/abs/1409.3215 (Figure 1)
| internet_rand0 wrote:
| > finding an ideal universal compressor which definitely
| doesn't exist.
|
| if only we could explain this in "politician" language... too
| many with too much power think the second coming will deliver
| the "ideal universal" which doesn't exist
| godelski wrote:
| > I've been having difficulty attempting to locate a simple
| fitness function that reliably proxies natural language string
| similarity
|
| Welcome to the curse of dimensionality. The underlying
| principle there is that as dimensionality increases the ability
| to distinguish the nearest point from the furthest diminishes.
| It really becomes difficult even in dimensions we'd consider
| low by ML standards (e.g. 10-D).
|
| But I think you need to also recognize that you used correct
| wording that suggests the difficulty. "reliably * _proxies*_
| natural language ". "Proxy" is the correct word here. It is
| actually true for _any_ measure. There is no measure that is
| perfectly aligned with the abstractions we are trying to
| measure. Even with something as mundane as distance. This
| naturally leads to Goodhart 's Law and is why you must
| recognize that measures are guides, not answers and not
| "proof".
|
| And the example you discuss is commonly called "Reward Hacking"
| or "overfitting". It's the same concept (along with Goodhart's
| Law) but just used in different domains. Your cost/loss
| function still represents a "reward". This is part of why it is
| so important to develop a good test set, but even that is ill-
| defined. Your test set shouldn't just be disjoint from your
| training, but there should be a certain distance between data.
| Even if curse of dimensionality didn't throw a wrench into this
| situation, there is no definition for what that distance should
| be. Too small and it might as well be training data.
| Preferentially you want to maximize it, but that limits the
| data that can exist in training. The balance is difficult to
| strike.
| godelski wrote:
| Many of these things are ones that people have been screaming
| about for years (including Sarah Hooker). It's great to see some
| numbers attached. And in classic Cohere manner, they are not
| holding punches on some specific people. Expect them to push
| back.
|
| There's a crux that makes it easy to understand why we should
| expect it. If you code (I assume you do) you probably (hopefully)
| know that you can't test your way into proving your code is
| correct. Test Driven Development (TDD) is a flawed paradigm. You
| should use tests, but they are hints. That's why Cohere is
| quoting Goodhart at the top of the intro[0]. There is _NO_ metric
| where the metric is perfectly aligned with the reason you
| implemented that metric in the first place (intent). This is
| fucking alignment 101 here. Which is why it is really ironic how
| prolific this attitude is in ML[1]. I 'm not sure I believe any
| person or company that claims they can make safe AI if they are
| trying to shove benchmarks at you.
|
| Pay close attention, evaluation is very hard. It is also getting
| harder. Remember reward hacking, it is still alive and well (it
| is Goodhart's Law). You have to think about what criteria meets
| your objective. This is true for any job! But think about RLHF
| and similar strategies. What methods also maximize the reward
| function? If it is human preference, deception maximizes just as
| well (or better) than accuracy. This is bad design pattern. You
| want to make errors as loud as possible, but this paradigm makes
| errors as quiet as possible and you cannot confuse that with lack
| of errors. It makes evaluation incredibly difficult.
|
| Metrics are guides, not targets
|
| [0] Users that recognize me may remember me for mentioning
| 'Goodhart's Hell', the adoption of Goodhart's Law as a feature
| instead of a bug. It is prolific, and problematic.
|
| [1] We used to say that when people say "AI" instead of "ML" to
| put your guard up. But a very useful one that's been true for
| years is "if people try to prove by benchmarks alone, they're
| selling snakeoil." There should always be analysis _in addition
| to_ metrics.
| mrandish wrote:
| I'm not even following AI model performance testing that closely
| but I'm hearing increasing reports they're inaccurate due to
| accidental or intentional test data leaking into training data
| and other ways of training to the test.
|
| Also, ARC AGI reported they've been unable to independently
| replicate OpenAI's claimed breakthrough score from December.
| There's just too much money at stake now to _not_ treat all AI
| model performance testing as an adversarial, no-holds-barred
| brawl. The default assumption should be all entrants will cheat
| in any way possible. Commercial entrants with large teams of
| highly-incentivized people will search and optimize for every
| possible advantage - if not outright cheat. As a result, smaller
| academic, student or community teams working part-time will tend
| to score lower than they would on a level playing field.
| godelski wrote:
| > inaccurate due to accidental or intentional test data leaking
| into training data and other ways of training to the test.
|
| Even if you assume no intentional data leakage it is fairly
| easy to accidentally do it. Defining good test data is hard.
| Your test data should be disjoint from training, which even
| exact deduplication is hard. But your test data should belong
| to the same target distribution BUT be sufficiently distant
| from your training data in order to measure generalization.
| This is ill-defined in the best of cases, and ideally you want
| to maximize the distance between training data and test data.
| But high dimensional settings mean distance is essentially
| meaningless (you cannot distinguish the nearest from the
| furthest).
|
| Plus there's standard procedures that are explicit data
| leakage. Commonly people will update hyperparameters to
| increase test results. While the model doesn't have access to
| the data, you are passing along information. You are the data
| (information) leakage. Meta information is still useful to
| machine learning models and they will exploit it. That's why
| there's things like optimal hyper-parameters, initialization
| schemes that lead to better solutions (or mode collapse), and
| even is part of the lottery ticket hypothesis.
|
| Measuring is pretty messy stuff, even in the best of
| situations. Intentional data leakage removes all sense of good
| faith. Unintentional data leakage stresses the importance of
| learning domain depth, and is one of the key reasons learning
| math is so helpful to machine learning. Even the intuition can
| provide critical insights. Ignoring this fact of life is
| myopic. > smaller academic, student or
| community teams working part-time will tend to score lower than
| they would on a level playing field.
|
| It is rare for academics and students to work "part-time". I'm
| about to defend my PhD (in ML) and I rarely take vacations and
| rarely work less than 50hrs/wk. This is also pretty common
| among my peers.
|
| But a big problem is that the "GPU Poor" notion is ill-founded.
| It ignores a critical aspect of the research and development
| cycle: _basic_ research. You might see this in something like
| NASA TRL[0]. Classically academics work predominantly in the
| low level TRLs, but there 's been this weird push in ML (and
| not too uncommon in CS in general) for placing a focus on
| products rather than expanding knowledge/foundations. While TRL
| 1-4 have extremely high failure rates (even between steps),
| they lay the foundation that allows us to build higher TRL
| things (i.e. products). This notion that you can't do small
| scale (data or compute) experiments and contribute to the field
| is damaging. It sets us back. It breeds stagnation as it
| necessitates narrowing of research directions. You can't be as
| risky! The consequence of this can only lead to a Wiley Coyote
| type moment, where we're running and suddenly find there is no
| ground beneath us. We had a good thing going. Gov money funds
| low level research which has higher risks and longer rates of
| return, but the research becomes public and thus provides
| foundations for others to build on top of.
|
| [0] https://www.nasa.gov/directorates/somd/space-
| communications-...
| mrandish wrote:
| > It is rare for academics and students to work "part-time".
|
| Sorry, that phrasing didn't properly convey my intent, which
| was more that most academics, students and
| community/hobbyists have other simultaneous responsibilities
| which they must balance.
| godelski wrote:
| Thanks for the clarification. I think this makes more
| sense, but I think I need to push back a tad. It is a bit
| messier (for academia, I don't disagree for
| community/hobbyists)
|
| In the US PhD system usually PhD students take classes
| during the first two years and this is often where they
| serve as teaching assistants too. But after quals (or
| whatever) you advance to PhD Candidate you no longer take
| classes and frequently your funding comes through grants or
| other areas (but may include teaching/assisting. Funding is
| always in flux...). For most of my time, and is common for
| most PhDs in my _department_ , I've been on research. While
| still classified as 0.49 employee and 0.51 student, the
| work is identical despite categorization.
|
| My point is that I would not generalize this notion.
| There's certainly very high variance, but I think it is
| less right than wrong. Sure, I do have other
| responsibilities like publishing, mentoring, and random
| bureaucratic administrative stuff, but this isn't
| exceptionally different from when I've interned or the 4
| years I spent working prior to going to grad school.
|
| Though I think something that is wild about this system
| (and generalizes outside academia), is that this completely
| flips when you graduate from PhD {Student,Candidate} to
| Professor. As a professor you have so much auxiliary
| responsibilities that most do not have time for research.
| You have to teach, do grant writing, there is a lot of
| department service (admins seem to increase this workload,
| not decrease...), and other stuff. It seems odd to train
| someone for many years and then put them in... a
| essentially a administrative or managerial role. I say this
| generalizes, because we do the same thing outside academia.
| You can usually only get promoted as an engineer (pick your
| term) for so long before you need to transition to
| management. Definitely I want technical managers, but that
| shouldn't prevent a path for advancement through technical
| capabilities. You spent all that time training and honing
| those skills, why abandon them? Why assume they transfer to
| the skills of management? (some do, but enough?). This is
| quite baffling to me and I don't know why we do this. In
| "academia" you can kinda avoid this by going to post-doc or
| a government labs, or even the private sector. But post-doc
| and private sector just delay this transition and
| government labs are hit or miss (but this is why people
| like working there and will often sacrifice salaries).
|
| (The idea in academia is you then have full freedom once
| you're tenured. But it isn't like the pressures of "publish
| or perish" disappear, and it is hard to break habits. Plus,
| you'd be a real dick if you are sacrificing your PhD
| students' careers in pursuit of your own work. So the
| idealized belief is quite inaccurate. If anything, we want
| young researchers to be attempting riskier research)
|
| TLDR: for graduate students, I disagree; but, for
| professors/hobbyists/undergrads/etc, I do agree
___________________________________________________________________
(page generated 2025-04-30 23:01 UTC)