[HN Gopher] The MTEB benchmark is dead
       ___________________________________________________________________
        
       The MTEB benchmark is dead
        
       Author : herecomethefuzz
       Score  : 31 points
       Date   : 2024-12-24 19:44 UTC (3 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | 0xab wrote:
       | Datasets need to stop shipping with any training sets at all! And
       | they should forbid anyone from using the test set to update the
       | parameters of any model through their license.
       | 
       | We did this with ObjectNet (https://objectnet.dev/) years ago.
       | It's only a test set, no training set provided at all. Back then
       | it was very controversial and we were given a hard time for it
       | initially. Now it's more accepted. Time to make this idea
       | mainstream.
       | 
       | No more training sets. Everything should be out of domain.
        
         | upghost wrote:
         | I don't know how this is possible with LLM tests. The closed
         | source models will get access to at least the questions when
         | sending the questions over the fence via API.
         | 
         | This gives closed source models an enormous advantage over
         | open-source models.
         | 
         | The FrontierMath dataset has this same problem[1].
         | 
         | It's a shame because creating these benchmarks is time
         | consuming and expensive.
         | 
         | I don't know of a way to fix this except perhaps partially by
         | using reward models to evaluate results on random questions
         | instead of using datasets, but there would be a lot of
         | reproducibility problems with that.
         | 
         | Still -- not sure how to overcome this.
         | 
         | [1]: https://news.ycombinator.com/item?id=42494217
        
           | light_hue_1 wrote:
           | It's possible.
           | 
           | I'm not worried about cheaters. We just need to lay out clear
           | rules. You cannot look at the inputs or outputs in any way.
           | You cannot log them. You cannot record them for future use.
           | Either manually or in an automated way.
           | 
           | If someone cheats, they will be found out. Their contribution
           | won't stand the test of time, no one will replicate those
           | results with their method. And their performance on datasets
           | that they cheated on will be astronomical compared to
           | everything else.
           | 
           | FrontierMath is a great example of a failure in this space.
           | By going closed, instead of using a license, they're created
           | massive confusion. At first they told us that the benchmark
           | was incredibly hard. And they showed reviewers subsets that
           | were hard. Now, they're telling us that actually, 25% of the
           | questions are easy. And 50% of the questions are pretty hard.
           | But only a small fraction are what the reviewers saw.
           | 
           | Closed datasets aren't the answer. They're just unscientific
           | nonsense. I refuse to even consider running on them.
           | 
           | We need test sets that are open for scrutiny. With licenses
           | that prevent abuse. We can be very creative about the
           | license. Like, you can only evaluate on this dataset once,
           | and must preregister your evaluations.
        
             | upghost wrote:
             | I would like to agree with you, but I doubt the honor
             | system will work here. We are talking about companies that
             | have blatantly trampled (or are willing to risk a judicial
             | confrontation about trampling) copyright. It would be
             | unreasonable to assume they would not engage in the same
             | behavior about benchmarks and test sets, especially with
             | the amount of money on the line for the winners.
        
       | artine wrote:
       | I'm not closely familiar with this benchmark, but data leakage in
       | machine learning can be way too easy to accidentally introduce
       | even under the best of intentions. It really does require
       | diligence at every stage of experiment and model design to
       | strictly firewall all test data from any and all training
       | influence. So, not surprising when leakage breaks highly
       | publicized benchmarks.
        
         | mcphage wrote:
         | > data leakage in machine learning can be way too easy to
         | accidentally introduce even under the best of intentions
         | 
         | And lots of people in this space definitely don't have the best
         | of intentions.
        
       | minimaxir wrote:
       | The MTEB benchmark was never that great since embeddings are used
       | for more specific domain-specific tasks (e.g. search/clustering)
       | that can't really be represented well in a generalized test,
       | moreso than LLM next-token-prediction benchmarks which aren't
       | great either.
       | 
       | As with all LLM models and their subproducts, the _only_ way to
       | ensure good results is to test yourself, ideally with less
       | subjective, real-world feedback metrics.
        
         | fzliu wrote:
         | > As with all LLM models and their subproducts, the only way to
         | ensure good results is to test yourself, ideally with less
         | subjective, real-world feedback metrics.
         | 
         | This is excellent advice. Sadly, very few people/organizations
         | implement their own evaluation suites.
         | 
         | It doesn't make much sense to put data infrastructure in
         | production without first evaluating its performance (IOPS,
         | uptime, scalability, etc.) on internal workloads; it is no
         | different for embedding models or models in general for that
         | matter.
        
       | cuuupid wrote:
       | It has been for a while, we ended up building our own test set to
       | evaluate embedding models on our domain.
       | 
       | What we realized after doing this is that MTEB has always been a
       | poor indicator, as embedding model performance varies wildly in-
       | domain compared to out-of-domain. You'll get decent performance
       | (lets say 70%) with most models, but eeking out gains over that
       | is domain-dependent more than it is model-dependent.
       | 
       | Personally I recommend NV-Embed because it's easy to deploy and
       | get the other performance measurements (e.g. speed) to be high
       | spec. You can then simply enrich your data itself by e.g. using
       | an LLM to create standardized artifacts that point back to the
       | original text, kind of like an "embedding symlink."
       | 
       | Our observation has widely been that after standardizing data,
       | the best-n models mostly perform the same.
        
         | RevEng wrote:
         | Unfortunately it requires commercial licensing. I spoke with
         | them a while ago about pricing and it was awfully expensive for
         | being just one part of a larger product. We have been trying
         | other common open source models and the results have been
         | comparable when using them for retrieval on our domain specific
         | data.
        
       | RevEng wrote:
       | I feel this is common throughout all of training, even on public
       | data. Every time we talk about something specific at length, that
       | becomes part of the training data and that influences the models.
       | For example, ask a problem about a butterfly flapping its wings
       | causing a tornado and all modern LLMs immediately recognize the
       | classic example of chaos theory, but change the entities and
       | suddenly it's not so smart. Same thing for the current fixation
       | on the number of Rs in strawberry.
       | 
       | There was recently a post showing how LLMs could actively try to
       | deceive the user to hide its conflicting alignment, and using a
       | chain of thought style prompt showed how it did this very
       | deliberately. However, the thought process it produced and the
       | wording sounded exactly like every example of this theoretical
       | alignment problem. Given that an LLM chooses the most probable
       | tokens based on what it has seen in training, could it be that we
       | unintentionally trained it to respond this way?
        
       ___________________________________________________________________
       (page generated 2024-12-24 23:00 UTC)