[HN Gopher] The Leaderboard Illusion
       ___________________________________________________________________
        
       The Leaderboard Illusion
        
       Author : pongogogo
       Score  : 139 points
       Date   : 2025-04-30 07:58 UTC (15 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | pongogogo wrote:
       | I think this is a really interesting paper from Cohere, it really
       | feels that at this point in time you can't trust any public
       | benchmark, and you really need your own private evals.
        
         | ilrwbwrkhv wrote:
         | Yup in my private evals I have repeatedly found that DeepSeek
         | has the best models for everything and yet in a lot of these
         | public ones it always seems like someone else is on the top. I
         | don't know why.
        
           | __alexs wrote:
           | Publishing them might help you find out.
        
             | refulgentis wrote:
             | ^ This.
             | 
             | If I had to hazard a guess, as a poor soul doomed to
             | maintain several closed and open source models acting
             | agentically, I think you are hyper focused on chat trivia
             | use cases (DeepSeek has a very, very, hard time tool
             | calling and they say as much themselves in their API docs)
        
         | AstroBen wrote:
         | Any tips on coming up with good private evals?
        
           | pongogogo wrote:
           | Yes, I wrote something up here on how Andrei Kaparthy
           | evaluated grok 3 ->
           | https://tomhipwell.co/blog/karpathy_s_vibes_check/
           | 
           | I would pick one of two parts of that analysis that are most
           | relevant to you and zoom in. I'd choose something difficult
           | that the model fails at, then look carefully at how the model
           | failures change as you test different model generations.
        
       | aredox wrote:
       | The fact those big LLM developers devote a significant amount of
       | effort to game benchmarks is a big show of confidence that they
       | are making progress towards AGI and will recoup those billions of
       | dollars and man-hours/s
        
         | amelius wrote:
         | Are the benchmark prompts public and isn't that where the
         | problem lies?
        
           | StevenWaterman wrote:
           | No, even if the benchmarks are private, it's still an issue.
           | Because you can overfit to the benchmark by trying X random
           | variations of the model, and picking the one that performs
           | best on the benchmark
           | 
           | It's similar to how I can pass any multiple-choice exam if
           | you let me keep attempting it and tell me my overall score at
           | the end of each attempt - even if you don't tell me which
           | answers were right/wrong
        
             | amelius wrote:
             | Maybe there should be some rate limiting on it then? I.e.,
             | once a month you can benchmark your model. Of course you
             | can submit under different names, but how many company
             | names can someone realistically come up with and register?
        
               | sebastiennight wrote:
               | So now you want OpenAI to go _even wilder_ in how they
               | name each new model?
        
               | amelius wrote:
               | 1 model per company per month, max.
        
             | VladVladikoff wrote:
             | Now I'm wondering what the most efficient algorithm to
             | obtain a mark of 100% in the least amount of attempts.
             | Guessing one question per attempt seems inefficient.
             | Perhaps guessing the whole exam as option A. Then
             | submitting the whole exam as option B. And so on, at the
             | start, could give you a count of how many As are correct.
             | Then maybe some sort of binary sort through the rest of the
             | options? You could submit the first 1/2 as A and the second
             | 1/2 as B. Etc. hmmmm
        
               | amelius wrote:
               | Maybe an llm can tell you how to best approach this
               | problem ;)
        
         | leto_ii wrote:
         | Is this sarcasm? Otherwise I'm not sure how that follows. Seems
         | more reasonable to believe that they're hitting walls and
         | switching to PR and productizing.
        
           | RodgerTheGreat wrote:
           | Ending a paragraph with "/s" is a moderately common
           | convention for conveying a sarcastic tone through text.
        
           | Terr_ wrote:
           | I believe they _are_ being sarcastic, but Poe 's Law is in
           | play and it's too ambiguous for practical purposes.
        
       | unkulunkulu wrote:
       | Sounds like classic inequality observed everywhere. Success leads
       | to attention leads to more success.
       | 
       | Why spend evaluation resources on outsiders? Everyone wants to
       | know who is exactly first second etc, after #10 it's do your own
       | evaluation if this is important to you.
       | 
       | Thus, we have this inequality.
        
         | boxed wrote:
         | Is it? Sounds to me like they run the same experiment many
         | times and keep the "best" results. Which is cheating, or if the
         | same thing is done in biomedical research: research fraud.
        
           | sumtechguy wrote:
           | Back in the slashdot days I would experiment on changing
           | conversations. This was due to the way SD would rank and show
           | its posts. Anything below a 3 would not change anything. But
           | if you could get in early AND get a +5 on your post you could
           | drive exactly what the conversation was about. Especially if
           | you were engaged a bit and were willing to add a few more
           | posts onto other posts.
           | 
           | Basically get in early and get a high rank and you are
           | usually going to 'win'. Now it does not work all the time.
           | But it had a very high success rate. I probably should have
           | studied it a bit more. My theory is any stack ranking
           | algorithm is susceptible to it. I also suspect it works
           | decently well due to the way people will create puppet
           | accounts to up rank things on different platforms. But you
           | know, need numbers to back that up...
        
             | cratermoon wrote:
             | Anecdotally, that same technique works on HN.
        
               | sunaookami wrote:
               | And Reddit
        
               | jerf wrote:
               | It's intrinsic to any karma system that has a global
               | karma rating, that is, the message has a concrete "karma"
               | value that is the same for all users.
               | 
               | drcongo recently referenced something I sort of wish I
               | had time to build:
               | https://news.ycombinator.com/item?id=43843116 And/or
               | could just go somewhere to use, which is a system where
               | an upvote doesn't mean " _everybody_ needs to see this
               | more " but instead means " _I_ want to see more of this
               | user 's comments", and downvotes mean the corresponding
               | opposite. It's more computationally difficult but would
               | create an interestingly different community, especially
               | as further elaborations were built on that. One of the
               | differences would be to mitigate the first-mover
               | advantage in conversations. Instead of it winning you
               | more karma if it appeals to the general public of the
               | relevant site, what it would instead do is expose you to
               | more people. That would produce more upvotes and
               | downvotes in general but wouldn't necessarily impact
               | visibility in the same way.
        
               | all2 wrote:
               | I'm building a simple community site (a HN clone) and I
               | haven't gotten to the ranking algorithms yet. I'm very
               | curious about how this could work.
        
         | cainxinth wrote:
         | So attention _is_ all you need?
        
           | ukuina wrote:
           | Bravo!
        
       | jmmcd wrote:
       | Absolutely devastating for the credibility of FAIR.
        
         | sheepdestroyer wrote:
         | I thought the latest llama were not from FAIR but from the
         | genai team
        
       | ekidd wrote:
       | Also, I've been hearing a lot of complaints that Chatbot Arena
       | tends to favor:
       | 
       | - Lots of bullet points in every response.
       | 
       | - Emoji.
       | 
       | ...even at the expense of accurate answers. And I'm beginning to
       | wonder if the sycophantic behavior of recent models ("That's a
       | brilliant and profound idea") is also being driven by Arena
       | scores.
       | 
       | Perhaps LLM users actually do want lots of bullets, emoji and
       | fawning praise. But this seems like a perverse dynamic, similar
       | to the way that social media users often engage more with content
       | that outrages them.
        
         | kozikow wrote:
         | More to that - at this point, it feels to me, that arenas are
         | getting too focused on fitting user preferences rather than
         | actual model quality.
         | 
         | In reality I prefer different model, for different things, and
         | quite often it's because model X is tuned to return more of my
         | preference - e.g. Gemini tends to be usually the best in non-
         | english, chatgpt works better for me personally for health
         | questions, ...
        
         | jimmaswell wrote:
         | > sycophantic behavior of recent models
         | 
         | The funniest example I've seen recently was "Dude. You just
         | said something deep as hell without even flinching. You're
         | 1000% right:"
        
           | pc86 wrote:
           | This type of response is the quickest way for me to start
           | verbally abusing the LLM.
        
         | n8m8 wrote:
         | Interesting idea, I think I'm on board with this correlation
         | hypothesis. Obviously it's complicated, but it does seems like
         | over-reliance on arbitrary opinions from average people would
         | result in valuing "feeling" over correctness.
        
       | lostmsu wrote:
       | Chiming in as usual: https://trashtalk.borg.games
       | 
       | A social deduction game for both LLMs and humans. All the past
       | games are available for anyone.
       | 
       | I'm open for feedback.
        
       | n8m8 wrote:
       | Predictable, yet incredibly important.
        
       | jmount wrote:
       | Not the same effect: but a good related writeup:
       | https://www.stefanmesken.info/machine%20learning/how-to-beat...
        
       | bob1029 wrote:
       | > Any observed statistical regularity will tend to collapse once
       | pressure is placed upon it for control purposes.
       | 
       | In context of genetic programming and other non-traditional ML
       | techniques, I've been having difficulty attempting to locate a
       | simple fitness function that reliably proxies natural language
       | string similarity due to this effect.
       | 
       | For example, say you use something like common prefix length to
       | measure how close a candidate's output string is to an objective
       | string given an input string. The underlying learner will
       | inevitably start doing things like repeating the input verbatim,
       | especially if the input/output training tuples often share a lot
       | of prefixes. So, you might try doing something like reversing the
       | input to force learning to take a less crappy path [0]. The
       | learner may respond degenerately by inventing a string reversing
       | technique and repeating its prior behavior. So, you iterate again
       | and try something like base64 encoding the input. This might
       | take, but eventually you wind up with so many weird hacks that
       | the learner can't make progress and the meaning of the quantities
       | evaporates.
       | 
       | Every metric I've ever looked at gets cheated in some way. The
       | holy grail is probably normalized information distance
       | (approximated by normalized compression distance), but then you
       | have a whole new problem of finding an ideal universal compressor
       | which definitely doesn't exist.
       | 
       | [0]: https://arxiv.org/abs/1409.3215 (Figure 1)
        
         | internet_rand0 wrote:
         | > finding an ideal universal compressor which definitely
         | doesn't exist.
         | 
         | if only we could explain this in "politician" language... too
         | many with too much power think the second coming will deliver
         | the "ideal universal" which doesn't exist
        
         | godelski wrote:
         | > I've been having difficulty attempting to locate a simple
         | fitness function that reliably proxies natural language string
         | similarity
         | 
         | Welcome to the curse of dimensionality. The underlying
         | principle there is that as dimensionality increases the ability
         | to distinguish the nearest point from the furthest diminishes.
         | It really becomes difficult even in dimensions we'd consider
         | low by ML standards (e.g. 10-D).
         | 
         | But I think you need to also recognize that you used correct
         | wording that suggests the difficulty. "reliably * _proxies*_
         | natural language ". "Proxy" is the correct word here. It is
         | actually true for _any_ measure. There is no measure that is
         | perfectly aligned with the abstractions we are trying to
         | measure. Even with something as mundane as distance. This
         | naturally leads to Goodhart 's Law and is why you must
         | recognize that measures are guides, not answers and not
         | "proof".
         | 
         | And the example you discuss is commonly called "Reward Hacking"
         | or "overfitting". It's the same concept (along with Goodhart's
         | Law) but just used in different domains. Your cost/loss
         | function still represents a "reward". This is part of why it is
         | so important to develop a good test set, but even that is ill-
         | defined. Your test set shouldn't just be disjoint from your
         | training, but there should be a certain distance between data.
         | Even if curse of dimensionality didn't throw a wrench into this
         | situation, there is no definition for what that distance should
         | be. Too small and it might as well be training data.
         | Preferentially you want to maximize it, but that limits the
         | data that can exist in training. The balance is difficult to
         | strike.
        
       | godelski wrote:
       | Many of these things are ones that people have been screaming
       | about for years (including Sarah Hooker). It's great to see some
       | numbers attached. And in classic Cohere manner, they are not
       | holding punches on some specific people. Expect them to push
       | back.
       | 
       | There's a crux that makes it easy to understand why we should
       | expect it. If you code (I assume you do) you probably (hopefully)
       | know that you can't test your way into proving your code is
       | correct. Test Driven Development (TDD) is a flawed paradigm. You
       | should use tests, but they are hints. That's why Cohere is
       | quoting Goodhart at the top of the intro[0]. There is _NO_ metric
       | where the metric is perfectly aligned with the reason you
       | implemented that metric in the first place (intent). This is
       | fucking alignment 101 here. Which is why it is really ironic how
       | prolific this attitude is in ML[1]. I 'm not sure I believe any
       | person or company that claims they can make safe AI if they are
       | trying to shove benchmarks at you.
       | 
       | Pay close attention, evaluation is very hard. It is also getting
       | harder. Remember reward hacking, it is still alive and well (it
       | is Goodhart's Law). You have to think about what criteria meets
       | your objective. This is true for any job! But think about RLHF
       | and similar strategies. What methods also maximize the reward
       | function? If it is human preference, deception maximizes just as
       | well (or better) than accuracy. This is bad design pattern. You
       | want to make errors as loud as possible, but this paradigm makes
       | errors as quiet as possible and you cannot confuse that with lack
       | of errors. It makes evaluation incredibly difficult.
       | 
       | Metrics are guides, not targets
       | 
       | [0] Users that recognize me may remember me for mentioning
       | 'Goodhart's Hell', the adoption of Goodhart's Law as a feature
       | instead of a bug. It is prolific, and problematic.
       | 
       | [1] We used to say that when people say "AI" instead of "ML" to
       | put your guard up. But a very useful one that's been true for
       | years is "if people try to prove by benchmarks alone, they're
       | selling snakeoil." There should always be analysis _in addition
       | to_ metrics.
        
       | mrandish wrote:
       | I'm not even following AI model performance testing that closely
       | but I'm hearing increasing reports they're inaccurate due to
       | accidental or intentional test data leaking into training data
       | and other ways of training to the test.
       | 
       | Also, ARC AGI reported they've been unable to independently
       | replicate OpenAI's claimed breakthrough score from December.
       | There's just too much money at stake now to _not_ treat all AI
       | model performance testing as an adversarial, no-holds-barred
       | brawl. The default assumption should be all entrants will cheat
       | in any way possible. Commercial entrants with large teams of
       | highly-incentivized people will search and optimize for every
       | possible advantage - if not outright cheat. As a result, smaller
       | academic, student or community teams working part-time will tend
       | to score lower than they would on a level playing field.
        
         | godelski wrote:
         | > inaccurate due to accidental or intentional test data leaking
         | into training data and other ways of training to the test.
         | 
         | Even if you assume no intentional data leakage it is fairly
         | easy to accidentally do it. Defining good test data is hard.
         | Your test data should be disjoint from training, which even
         | exact deduplication is hard. But your test data should belong
         | to the same target distribution BUT be sufficiently distant
         | from your training data in order to measure generalization.
         | This is ill-defined in the best of cases, and ideally you want
         | to maximize the distance between training data and test data.
         | But high dimensional settings mean distance is essentially
         | meaningless (you cannot distinguish the nearest from the
         | furthest).
         | 
         | Plus there's standard procedures that are explicit data
         | leakage. Commonly people will update hyperparameters to
         | increase test results. While the model doesn't have access to
         | the data, you are passing along information. You are the data
         | (information) leakage. Meta information is still useful to
         | machine learning models and they will exploit it. That's why
         | there's things like optimal hyper-parameters, initialization
         | schemes that lead to better solutions (or mode collapse), and
         | even is part of the lottery ticket hypothesis.
         | 
         | Measuring is pretty messy stuff, even in the best of
         | situations. Intentional data leakage removes all sense of good
         | faith. Unintentional data leakage stresses the importance of
         | learning domain depth, and is one of the key reasons learning
         | math is so helpful to machine learning. Even the intuition can
         | provide critical insights. Ignoring this fact of life is
         | myopic.                 > smaller academic, student or
         | community teams working part-time will tend to score lower than
         | they would on a level playing field.
         | 
         | It is rare for academics and students to work "part-time". I'm
         | about to defend my PhD (in ML) and I rarely take vacations and
         | rarely work less than 50hrs/wk. This is also pretty common
         | among my peers.
         | 
         | But a big problem is that the "GPU Poor" notion is ill-founded.
         | It ignores a critical aspect of the research and development
         | cycle: _basic_ research. You might see this in something like
         | NASA TRL[0]. Classically academics work predominantly in the
         | low level TRLs, but there 's been this weird push in ML (and
         | not too uncommon in CS in general) for placing a focus on
         | products rather than expanding knowledge/foundations. While TRL
         | 1-4 have extremely high failure rates (even between steps),
         | they lay the foundation that allows us to build higher TRL
         | things (i.e. products). This notion that you can't do small
         | scale (data or compute) experiments and contribute to the field
         | is damaging. It sets us back. It breeds stagnation as it
         | necessitates narrowing of research directions. You can't be as
         | risky! The consequence of this can only lead to a Wiley Coyote
         | type moment, where we're running and suddenly find there is no
         | ground beneath us. We had a good thing going. Gov money funds
         | low level research which has higher risks and longer rates of
         | return, but the research becomes public and thus provides
         | foundations for others to build on top of.
         | 
         | [0] https://www.nasa.gov/directorates/somd/space-
         | communications-...
        
           | mrandish wrote:
           | > It is rare for academics and students to work "part-time".
           | 
           | Sorry, that phrasing didn't properly convey my intent, which
           | was more that most academics, students and
           | community/hobbyists have other simultaneous responsibilities
           | which they must balance.
        
             | godelski wrote:
             | Thanks for the clarification. I think this makes more
             | sense, but I think I need to push back a tad. It is a bit
             | messier (for academia, I don't disagree for
             | community/hobbyists)
             | 
             | In the US PhD system usually PhD students take classes
             | during the first two years and this is often where they
             | serve as teaching assistants too. But after quals (or
             | whatever) you advance to PhD Candidate you no longer take
             | classes and frequently your funding comes through grants or
             | other areas (but may include teaching/assisting. Funding is
             | always in flux...). For most of my time, and is common for
             | most PhDs in my _department_ , I've been on research. While
             | still classified as 0.49 employee and 0.51 student, the
             | work is identical despite categorization.
             | 
             | My point is that I would not generalize this notion.
             | There's certainly very high variance, but I think it is
             | less right than wrong. Sure, I do have other
             | responsibilities like publishing, mentoring, and random
             | bureaucratic administrative stuff, but this isn't
             | exceptionally different from when I've interned or the 4
             | years I spent working prior to going to grad school.
             | 
             | Though I think something that is wild about this system
             | (and generalizes outside academia), is that this completely
             | flips when you graduate from PhD {Student,Candidate} to
             | Professor. As a professor you have so much auxiliary
             | responsibilities that most do not have time for research.
             | You have to teach, do grant writing, there is a lot of
             | department service (admins seem to increase this workload,
             | not decrease...), and other stuff. It seems odd to train
             | someone for many years and then put them in... a
             | essentially a administrative or managerial role. I say this
             | generalizes, because we do the same thing outside academia.
             | You can usually only get promoted as an engineer (pick your
             | term) for so long before you need to transition to
             | management. Definitely I want technical managers, but that
             | shouldn't prevent a path for advancement through technical
             | capabilities. You spent all that time training and honing
             | those skills, why abandon them? Why assume they transfer to
             | the skills of management? (some do, but enough?). This is
             | quite baffling to me and I don't know why we do this. In
             | "academia" you can kinda avoid this by going to post-doc or
             | a government labs, or even the private sector. But post-doc
             | and private sector just delay this transition and
             | government labs are hit or miss (but this is why people
             | like working there and will often sacrifice salaries).
             | 
             | (The idea in academia is you then have full freedom once
             | you're tenured. But it isn't like the pressures of "publish
             | or perish" disappear, and it is hard to break habits. Plus,
             | you'd be a real dick if you are sacrificing your PhD
             | students' careers in pursuit of your own work. So the
             | idealized belief is quite inaccurate. If anything, we want
             | young researchers to be attempting riskier research)
             | 
             | TLDR: for graduate students, I disagree; but, for
             | professors/hobbyists/undergrads/etc, I do agree
        
       ___________________________________________________________________
       (page generated 2025-04-30 23:01 UTC)