[HN Gopher] The MTEB benchmark is dead
___________________________________________________________________
The MTEB benchmark is dead
Author : herecomethefuzz
Score : 31 points
Date : 2024-12-24 19:44 UTC (3 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| 0xab wrote:
| Datasets need to stop shipping with any training sets at all! And
| they should forbid anyone from using the test set to update the
| parameters of any model through their license.
|
| We did this with ObjectNet (https://objectnet.dev/) years ago.
| It's only a test set, no training set provided at all. Back then
| it was very controversial and we were given a hard time for it
| initially. Now it's more accepted. Time to make this idea
| mainstream.
|
| No more training sets. Everything should be out of domain.
| upghost wrote:
| I don't know how this is possible with LLM tests. The closed
| source models will get access to at least the questions when
| sending the questions over the fence via API.
|
| This gives closed source models an enormous advantage over
| open-source models.
|
| The FrontierMath dataset has this same problem[1].
|
| It's a shame because creating these benchmarks is time
| consuming and expensive.
|
| I don't know of a way to fix this except perhaps partially by
| using reward models to evaluate results on random questions
| instead of using datasets, but there would be a lot of
| reproducibility problems with that.
|
| Still -- not sure how to overcome this.
|
| [1]: https://news.ycombinator.com/item?id=42494217
| light_hue_1 wrote:
| It's possible.
|
| I'm not worried about cheaters. We just need to lay out clear
| rules. You cannot look at the inputs or outputs in any way.
| You cannot log them. You cannot record them for future use.
| Either manually or in an automated way.
|
| If someone cheats, they will be found out. Their contribution
| won't stand the test of time, no one will replicate those
| results with their method. And their performance on datasets
| that they cheated on will be astronomical compared to
| everything else.
|
| FrontierMath is a great example of a failure in this space.
| By going closed, instead of using a license, they're created
| massive confusion. At first they told us that the benchmark
| was incredibly hard. And they showed reviewers subsets that
| were hard. Now, they're telling us that actually, 25% of the
| questions are easy. And 50% of the questions are pretty hard.
| But only a small fraction are what the reviewers saw.
|
| Closed datasets aren't the answer. They're just unscientific
| nonsense. I refuse to even consider running on them.
|
| We need test sets that are open for scrutiny. With licenses
| that prevent abuse. We can be very creative about the
| license. Like, you can only evaluate on this dataset once,
| and must preregister your evaluations.
| upghost wrote:
| I would like to agree with you, but I doubt the honor
| system will work here. We are talking about companies that
| have blatantly trampled (or are willing to risk a judicial
| confrontation about trampling) copyright. It would be
| unreasonable to assume they would not engage in the same
| behavior about benchmarks and test sets, especially with
| the amount of money on the line for the winners.
| artine wrote:
| I'm not closely familiar with this benchmark, but data leakage in
| machine learning can be way too easy to accidentally introduce
| even under the best of intentions. It really does require
| diligence at every stage of experiment and model design to
| strictly firewall all test data from any and all training
| influence. So, not surprising when leakage breaks highly
| publicized benchmarks.
| mcphage wrote:
| > data leakage in machine learning can be way too easy to
| accidentally introduce even under the best of intentions
|
| And lots of people in this space definitely don't have the best
| of intentions.
| minimaxir wrote:
| The MTEB benchmark was never that great since embeddings are used
| for more specific domain-specific tasks (e.g. search/clustering)
| that can't really be represented well in a generalized test,
| moreso than LLM next-token-prediction benchmarks which aren't
| great either.
|
| As with all LLM models and their subproducts, the _only_ way to
| ensure good results is to test yourself, ideally with less
| subjective, real-world feedback metrics.
| fzliu wrote:
| > As with all LLM models and their subproducts, the only way to
| ensure good results is to test yourself, ideally with less
| subjective, real-world feedback metrics.
|
| This is excellent advice. Sadly, very few people/organizations
| implement their own evaluation suites.
|
| It doesn't make much sense to put data infrastructure in
| production without first evaluating its performance (IOPS,
| uptime, scalability, etc.) on internal workloads; it is no
| different for embedding models or models in general for that
| matter.
| cuuupid wrote:
| It has been for a while, we ended up building our own test set to
| evaluate embedding models on our domain.
|
| What we realized after doing this is that MTEB has always been a
| poor indicator, as embedding model performance varies wildly in-
| domain compared to out-of-domain. You'll get decent performance
| (lets say 70%) with most models, but eeking out gains over that
| is domain-dependent more than it is model-dependent.
|
| Personally I recommend NV-Embed because it's easy to deploy and
| get the other performance measurements (e.g. speed) to be high
| spec. You can then simply enrich your data itself by e.g. using
| an LLM to create standardized artifacts that point back to the
| original text, kind of like an "embedding symlink."
|
| Our observation has widely been that after standardizing data,
| the best-n models mostly perform the same.
| RevEng wrote:
| Unfortunately it requires commercial licensing. I spoke with
| them a while ago about pricing and it was awfully expensive for
| being just one part of a larger product. We have been trying
| other common open source models and the results have been
| comparable when using them for retrieval on our domain specific
| data.
| RevEng wrote:
| I feel this is common throughout all of training, even on public
| data. Every time we talk about something specific at length, that
| becomes part of the training data and that influences the models.
| For example, ask a problem about a butterfly flapping its wings
| causing a tornado and all modern LLMs immediately recognize the
| classic example of chaos theory, but change the entities and
| suddenly it's not so smart. Same thing for the current fixation
| on the number of Rs in strawberry.
|
| There was recently a post showing how LLMs could actively try to
| deceive the user to hide its conflicting alignment, and using a
| chain of thought style prompt showed how it did this very
| deliberately. However, the thought process it produced and the
| wording sounded exactly like every example of this theoretical
| alignment problem. Given that an LLM chooses the most probable
| tokens based on what it has seen in training, could it be that we
| unintentionally trained it to respond this way?
___________________________________________________________________
(page generated 2024-12-24 23:00 UTC)