[HN Gopher] A cartel of influential datasets are dominating mach...
       ___________________________________________________________________
        
       A cartel of influential datasets are dominating machine learning
       research
        
       Author : Hard_Space
       Score  : 227 points
       Date   : 2021-12-06 10:46 UTC (12 hours ago)
        
 (HTM) web link (www.unite.ai)
 (TXT) w3m dump (www.unite.ai)
        
       | omarhaneef wrote:
       | A little confused:
       | 
       | Are these big "dominant" institutions charging for this data? No,
       | the spend a lot of resources putting them together and give them
       | away free.
       | 
       | Are they preventing others from giving away data? No, but it
       | costs a lot and they bear that cost.
       | 
       | Are they forcing smaller institutions to use their data? No. Its
       | just a free resource they offer.
       | 
       | Do they get the grants themselves because they have some kind of
       | proprietary access? No, the whole point is that the benchmark is
       | open and everyone has access.
       | 
       | So they collect this data, vet it, propose their use for
       | benchmarks and give it away free. What is the complaint? The
       | problem is not even posed as "well, this data is overfitted in
       | papers or we are solving narrower problems". [Edit: they sort of
       | do, but leaving this here so the comment below makes sense as a
       | follow up.]
       | 
       | No the complaint is that these handful of institutions are giving
       | away free data so too many people use it? Can we have more
       | "problems" of this nature?
        
         | michaelt wrote:
         | _> So they collect this data, vet it, propose their use for
         | benchmarks and give it away free. What is the complaint?_
         | 
         | It's well known that neural networks can easily inherit biases
         | from their training data. It's also well known that datasets
         | generated by western universities are widely used in training
         | and evaluating neural networks.
         | 
         | If my training set is full of pictures of Stanford CS
         | undergraduates, I could end up with a computational photography
         | system that makes everyone look like Stanford CS
         | undergraduates, or a historical photo colourisation system that
         | makes everyone look like Stanford CS undergraduates, or a self
         | driving car pedestrian tracking system that expects 90% of
         | pedestrians to look like Stanford CS undergraduates.
         | 
         | And if the people who make the model say "Hey, our model's
         | biases aren't our responsibility, we're just representing the
         | training data as best we can" and the people who make the
         | training data say "Hey, we never claimed it was perfect, you
         | can take it or leave it" these problems might fall through the
         | cracks.
        
           | 908B64B197 wrote:
           | > It's well known that neural networks can easily inherit
           | biases from their training data. It's also well known that
           | datasets generated by western universities are widely used in
           | training and evaluating neural networks. If my training set
           | is full of pictures of Stanford CS undergraduates, I could
           | end up with a computational photography system that makes
           | everyone look like Stanford CS undergraduates, or a
           | historical photo colourisation system that makes everyone
           | look like Stanford CS undergraduates, or a self driving car
           | pedestrian tracking system that expects 90% of pedestrians to
           | look like Stanford CS undergraduates.
           | 
           | Why then aren't foreign universities/companies simply...
           | building their own datasets?
        
           | mattnewton wrote:
           | > And if the people who make the model say "Hey, our model's
           | biases aren't our responsibility, we're just representing the
           | training data as best we can"
           | 
           | I don't think this excuse is like the others. If the model
           | doesn't work well because they used bias data, it is the job
           | of the people making the model to find better data (or
           | manipulate the training process to overweight some data and
           | attempt to counteract the bias).
           | 
           | I think the burden of responsibility has to be on the people
           | who make models or put models into products to make sure the
           | model is a good fit for the problem it is solving.
        
         | lumost wrote:
         | I believe the author is suggesting that large institutions are
         | doing this to earn extra citations - the currency of academia.
         | 
         | A large institution can make a dataset for X then browbeat
         | other researchers into using X and citing X. Using X also
         | likely leads to citations of derivative work by the lead
         | institution.
        
           | omarhaneef wrote:
           | Is the term "browbeat" fair here? I don't think they're
           | making calls and saying "Oh, nice paper, but I notice you
           | used this other dataset..." No, they're putting out good
           | quality datasets that people want to use. If that earns them
           | a citation, good for them. Its the least I can do for helping
           | me test my algo.
        
             | lumost wrote:
             | It depends, feedback is rarely as polite as "I noticed you
             | used this other dataset". The feedback would probably look
             | like.
             | 
             | - "Nice paper, however the results are not relevant to
             | current research due to the use of X dataset rather than Y
             | or Z datasets score 2/5 do not accept."
             | 
             | - "Nice paper, however the results are of unknown quality
             | due to the use of X dataset 3/5 recommend poster track".
             | 
             | In fact I'd generally say that most paper reviews would
             | drop the first three words of those feedbacks. It's not an
             | unreasonable assertion that progress is measured on
             | standard datasets - but it's also necessary to push back on
             | this.
        
               | durovo wrote:
               | If a non-standard dataset is being used, I would expect
               | there to be a discussion/analysis on what characteristics
               | of that dataset made it unusable for this paper.
               | Especially if a proposed model is being compared against
               | models that were trained on those standard datasets.
               | 
               | If you are establishing new baselines using those same
               | models on your non-standard dataset, then one would
               | expect you to put in a good amount of effort to finetune
               | all the knobs to get a reasonable result. If the authors
               | put are able to put in that much effort, then that kind
               | of feedback is definitely unreasonable.
        
               | YeGoblynQueenne wrote:
               | >> If a non-standard dataset is being used, I would
               | expect there to be a discussion/analysis on what
               | characteristics of that dataset made it unusable for this
               | paper.
               | 
               | Unfortunately that just adds more work for the reviewer,
               | which is a motive for many reviewers to scrap the paper
               | so they don't have to do the extra work.
               | 
               | That sounds mean, so I will quote (yet again) Geoff
               | Hinton on things that "make the brain hurt":
               | 
               |  _GH: One big challenge the community faces is that if
               | you want to get a paper published in machine learning now
               | it 's got to have a table in it, with all these different
               | data sets across the top, and all these different methods
               | along the side, and your method has to look like the best
               | one. If it doesn't look like that, it's hard to get
               | published. I don't think that's encouraging people to
               | think about radically new ideas._
               | 
               |  _Now if you send in a paper that has a radically new
               | idea, there 's no chance in hell it will get accepted,
               | because it's going to get some junior reviewer who
               | doesn't understand it. Or it's going to get a senior
               | reviewer who's trying to review too many papers and
               | doesn't understand it first time round and assumes it
               | must be nonsense. Anything that makes the brain hurt is
               | not going to get accepted. And I think that's really
               | bad._
               | 
               | https://www.wired.com/story/googles-ai-guru-computers-
               | think-...
               | 
               | Basically a new dataset is like a new idea: it makes the
               | brain hurt, for the overburdened experienced researcher
               | or inexperienced younger researcher alike. Testing a new
               | approach on a new dataset? That makes brain go boom.
               | 
               | Which is a funny state of affairs. Not so long ago it
               | used to be that one sure-fire way to make a significant
               | contribution that would give your paper a leg up over the
               | competition was to create a new dataset. I was advised as
               | much at the start of my PhD (four ish years ago). Seems
               | like this has already changed.
        
               | durovo wrote:
               | You might be right here. My comment was more of my
               | expectation as a reader on what should be present in such
               | a paper.
        
               | blt wrote:
               | Absolutely. Failure to report results on a popular
               | benchmark suggests to some reviewers that you have
               | something to hide - even though they might be
               | computationally expensive or tangential to the main point
               | of the work.
        
         | Levitz wrote:
         | >The problem is not even posed as "well, this data is
         | overfitted in papers or we are solving narrower problems".
         | 
         | I mean it does get into that:
         | 
         | "They additionally note that blind adherence to this small
         | number of 'gold' datasets encourages researchers to achieve
         | results that are overfitted (i.e. that are dataset-specific and
         | not likely to perform anywhere near as well on real-world data,
         | on new academic or original datasets, or even necessarily on
         | different datasets in the 'gold standard')."
        
           | omarhaneef wrote:
           | True, I guess my complaint is that they don't get into the
           | mechanisms.
           | 
           | The sibling comment on benchmark lottery paper lays this out.
           | But I should modify.
        
             | [deleted]
        
       | sabhiram wrote:
       | If there was a dataset that solved a problem sufficiently
       | (perception, self driving, whatever), there would be no more need
       | for any study of that field (other than for marginal
       | improvements).
       | 
       | Once solved, these no longer are research areas. Getting X score
       | on a Y validation set means you get your s/w drivers license.
        
       | Derbasti wrote:
       | This is what my dissertation was about, in a different field:
       | Everyone using the same datasets, even though they are severely
       | flawed.
       | 
       | As it turned out, differences between datasets proved
       | significantly larger (by a big margin) than differences between
       | algorithms. And the most popular datasets in fact included biases
       | and eccentricities that were bound to cause such problems.
       | 
       | https://bastibe.github.io/Dissertation-Website/ (figures
       | 12.9-12.11, if you're interested)
        
         | fock wrote:
         | thank you! I also found the same thing during my PhD and it
         | took a basic-pay-topublish paper and 2 years to get my
         | supervisor to agree that this is not really going anywhere with
         | publications and switch to a field, where I am more comfortable
         | with producing publications. Essentially my field has 1 big
         | dataset, and then domain experts "improve" models by creating
         | their own data (200 samples) and then claim a "novel" method
         | (which is not transferable at all).
         | 
         | The pay2publish-paper was sent back by 2 journals with reviews
         | indicating exactly "our method is better, just use the right
         | (our) thing" (which I _all_ refuted for the professor by doing
         | it, but the editor wouldn't hear anymore...). And then there's
         | papers predicting features through a complex preprocessing
         | pipeline in these journals. Academia and big companies are just
         | idiotic.
        
           | Derbasti wrote:
           | Funny that you would mention trying to publish such findings
           | in journals. I tried... more than two times, as well.
           | Rejected with the most spurious of claims, despite my
           | presenting evidence disproving them. Frankly, these journal
           | submissions were _the_ most frustrating and trying
           | experiences in my professional life.
           | 
           | Oh well. I'm glad I got my degree, and could leave academia
           | relatively unharmed.
           | 
           | Now I work in AI (engineering!), where most science is
           | highschool-level "hey, I tried stuff, and things happened.
           | Dunno why, though". It's just ridiculous.
        
       | zibzab wrote:
       | Somewhat related to this: What methods do people normally use to
       | measure the quality of a dataset?
       | 
       | For example, if there are 50 datasets of historical weather data
       | how can I determine which one is garbage?
        
         | dunefox wrote:
         | I would say it's garbage if there's a paper like this:
         | https://arxiv.org/abs/1902.01007
        
           | visarga wrote:
           | That's like hard negative mining. You can also train a weak
           | model and filter out the examples it manages to predict
           | correctly to come up with a harder dataset.
        
         | [deleted]
        
       | YeGoblynQueenne wrote:
       | >> According to the paper, Computer Vision research is notably
       | more affected by the syndrome it outlines than other sectors,
       | with the authors noting that Natural Language Processing (NLP)
       | research is far less affected. The authors suggest that this
       | could be because NLP communities are 'more coherent' and larger
       | in size, and because NLP datasets are more accessible and easier
       | to curate, as well as being smaller and less resource-intensive
       | in terms of data-gathering.
       | 
       | I'll have to read the paper to see what exactly it says on this
       | but my knowledge of NLP benchmark datasets is exactly the
       | opposite: the majority are simply no use for measuring the
       | capabilities they're supposed to be measuring. For example,
       | natural language _understanding_ datasets are typically created
       | as multiple-choice questionnaires, so that, a) there is already a
       | baseline accuracy that a system can "hit" just by chance (but
       | which is almost never noted, or compared against, in papers) and
       | b) a system that's good at classification can beat the benchmarks
       | black and blue without doing any "understanding". And this is,
       | indeed, what's been going on for quite a while now with large
       | language models that take all the trophies and are still dumb as
       | bricks.
       | 
       | To make matters worse, NLP also doesn't have any good _metrcis_
       | of performance. Stuff like the BLEU scores are just laughably
       | inadequate. NLP is all bad metrics over bad benchmarks. _And_ NLP
       | results are much harder to just  "eyball" than machine vision
       | results (and the models are much harder to interpret than machine
       | vision models where you can at least visualise the activations
       | and see... something). I think NLP is much, much worse than
       | machine vision.
        
         | xiaolingxiao wrote:
         | Your suspicion is correct. I worked on such a dataset paper and
         | worked to "beat" other methods on well-accepted benchmarks with
         | dubious accuracy scores. One fundamental issue is that outside
         | of POS tagging, there isn't a notion of empirical truth to
         | measure against, only a small sample of what a "normal person
         | would think." This stands in contrast to computer vision,
         | whereby in a task such as monocular depth perception from a
         | single frame, you can always measure against a Lidar-scanned
         | depth map. The system can still overfit on the benchmark and
         | they do, but at least the baseline truth itself is not in
         | dispute. But questions such as: is this the "appropriate"
         | response to a query is too open to interpretation.
        
       | hamasho wrote:
       | I think it's good that well-founded institutions publish well-
       | crafted datasets for everyone else. It helps small teams develop
       | their model.
       | 
       | The problem here is, though, most publishers don't accept papers
       | if the new proposal isn't backed by benchmarks by the well-known
       | datasets. Even though it can be a competitive approach for a
       | specific field, they reject the paper anyway if it's not
       | performed well on the datasets.
        
         | throwawayay02 wrote:
         | Do they really reject them though? In my experience publishers
         | will publish anything, if anything they should have higher
         | standards not lower.
        
           | domenicrosati wrote:
           | For brand name journals and conferences, yes they do. My
           | experience with those is a very high rejection rate.
           | 
           | There is a good reason to reject if you are not using a
           | standard dataset. How can you compare the results of two
           | approaches, to say natural language inference of whether one
           | sentence entails another, without results being tested on the
           | same dataset?
           | 
           | I think one thing overlooked in the conversation is that many
           | papers start with a standard baseline and then use another
           | dataset to establish additional results.
           | 
           | In my experience in nlp also, journals and conferences tend
           | to establish datasets of their own when they make a call for
           | submission. Often these are called the "shared task" track.
           | ACL has operated this way for decades.
        
         | irjustin wrote:
         | It's a pattern matching problem. It takes a lot of work and too
         | many unknowns when dealing with "new benchmark sets" or ...
         | just any other set.
        
       | F6F6FA wrote:
       | The word "cartel" is too negative, especially when discussing
       | political and social factors. The narrative seems written, before
       | their findings. Or at least, it could well have.
       | 
       | For a computer science analogy: It is a paper with the finding
       | that most successful computer languages are created at
       | prestigious institutes. An obvious -not a bad - finding. Not like
       | you could give the motivation, skills, expertise, resources, and
       | time to a small new institute, and expect these to come up with a
       | new language which the community will adopt.
       | 
       | Yes, if you write and publish a good data set, and it gets
       | adopted by the community, then you gain lots of citations. This
       | reward is known, and therefore some researchers expend the effort
       | of gathering and curating all this data.
       | 
       | It is not a "vehicle for inequality in science". Benchmarks in ML
       | are a way to create an equal playing field for all, and allows
       | one to compare results. Picking a non-standard new benchmark to
       | evaluate your algorithm is bad practice. And benchmarks are the
       | true meritocracy. Beat the benchmark, and you too can publish. No
       | matter the PR or extra resources from big labs. It is test
       | evaluation that counts, and this makes it fair. Other fields may
       | have authorities writing papers without even an evaluation.
       | That's not a good position for a field to be in.
       | 
       | > The prima facie scientific validity granted by SOTA
       | benchmarking is generically confounded with the social
       | credibility researchers obtain by showing they can compete on a
       | widely recognized dataset
       | 
       | Here, authors pretend social credibility of researchers has any
       | sway. There is no social credibility for a Master's student in
       | Bangladesh, but when they show they can compete, then they can
       | join and publish. Wonderful!
       | 
       | Where the authors use the long history of train-test splits, to
       | pose the cons have outweighed the benefits, they should reason
       | more and provide more data to actually show this and get the
       | field to get along. Ironically, people take more note of this
       | very paper, due to the institution affiliation of the authors. I
       | do too. If they had a benchmark, I would have first looked at
       | that.
       | 
       | > Given the observed high concentration of research on a small
       | number of benchmark datasets, we believe diversifying forms of
       | evaluation is especially important to avoid overfitting to
       | existing datasets and misrepresenting progress in the field.
       | 
       | I believe these authors find diversity important. But for
       | overfitting, these should look at actual (meta-) studies and
       | data. This seems conflicting. For instance:
       | 
       | > A Meta-Analysis of Overfitting in Machine Learning (2019)
       | 
       | > We conduct the first large meta-analysis of overfitting due to
       | test set reuse in the machine learning community. Our analysis is
       | based on over one hundred machine learning competitions hosted on
       | the Kaggle platform over the course of several years. In each
       | competition, numerous practitioners repeatedly evaluated their
       | progress against a holdout set that forms the basis of a public
       | ranking available throughout the competition. Performance on a
       | separate test set used only once determined the final ranking. By
       | systematically comparing the public ranking with the final
       | ranking, we assess how much participants adapted to the holdout
       | set over the course of a competition. Our study shows, somewhat
       | surprisingly, little evidence of substantial overfitting. These
       | findings speak to the robustness of the holdout method across
       | different data domains, loss functions, model classes, and human
       | analysts.
        
       | carschno wrote:
       | Too bad they don't cite the paper "The Benchmark Lottery" (M.
       | Dehghani et al, 2021) (https://arxiv.org/abs/2107.07002)
       | 
       | > The world of empirical machine learning (ML) strongly relies on
       | benchmarks in order to determine the relative effectiveness of
       | different algorithms and methods. This paper proposes the notion
       | of "a benchmark lottery" that describes the overall fragility of
       | the ML benchmarking process. The benchmark lottery postulates
       | that many factors, other than fundamental algorithmic
       | superiority, may lead to a method being perceived as superior. On
       | multiple benchmark setups that are prevalent in the ML community,
       | we show that the relative performance of algorithms may be
       | altered significantly simply by choosing different benchmark
       | tasks, highlighting the fragility of the current paradigms and
       | potential fallacious interpretation derived from benchmarking ML
       | methods. Given that every benchmark makes a statement about what
       | it perceives to be important, we argue that this might lead to
       | biased progress in the community. We discuss the implications of
       | the observed phenomena and provide recommendations on mitigating
       | them using multiple machine learning domains and communities as
       | use cases, including natural language processing, computer
       | vision, information retrieval, recommender systems, and
       | reinforcement learning.
       | 
       | Edit: By "they", I was actually referring to the linked article.
       | Strangely, even the paper the article is about does not cite "The
       | Benchmark Lottery" at all.
        
         | stayfrosty420 wrote:
         | Seems like a distinct issue to me, though there are obvious
         | parellels and similar outcomes.
        
           | koheripbal wrote:
           | Because AIs are relatively narrowly focused, they suffer
           | greatly from limited data sets. More often than not, an AI
           | will simply memorize test data rather than "learning".
           | 
           | This impacts the benchmarking process, so I think it's
           | relevant.
        
             | mellavora wrote:
             | Hey, you're giving away my secret for passing AWS
             | certification exams!
        
           | carschno wrote:
           | Agree, I was not saying it was redundant. But I would have
           | expected a reference, as that paper was the first related
           | item that came to my mind.
        
         | sea-shunned wrote:
         | It's also a shame that this paper in turn doesn't cite "Testing
         | Heuristics: We Have It All Wrong" (J. N. Hooker, 1995) (http://
         | citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71....),
         | which discusses these issues in much the same way. It's good to
         | see in "The Benchmark Lottery", however, that they look more
         | into specific tasks and their algorithmic rankings and provide
         | some sound recommendations.
         | 
         | One thing that I'd add (somewhat selfishly as it relates to my
         | PhD work), is the idea of generating datasets that are
         | deliberately challenging for different algorithms. Scale this
         | across a test suite of algorithms, and their relative strengths
         | and weaknesses become clearer. The caveat here is that it
         | requires having a set of measures that quantify different types
         | of problem difficulty, which depending on the task/domain can
         | range from well-defined to near-impossible.
        
           | gopher_space wrote:
           | I've been looking at parsing paragraph structure and have
           | started thinking about a conceptual mechanical turk/e e
           | cummings line in the sand where it's just going to be easier
           | to pay some kid with a cell phone to read words for you. The
           | working implementations I've seen are heavily tied to domain
           | and need to nail down language, which isn't really a thing.
           | 
           | Quantification is fascinating, it seems to be something I
           | take for granted until I actually want to make decisions.
           | It's like I'm constantly trying to forget that analog and
           | digital are two totally separate concepts. I wouldn't really
           | recommend reading Castaneda to anyone but he describes people
           | living comfortably with mutually exclusive ideas in their
           | head walled off by context, and I'd like that sort of
           | understanding.
        
       | Der_Einzige wrote:
       | Well, I recently released a NLP dataset of almost 200K documents
       | and I got 4 whole citations in a year. Wish I could find a way to
       | join this cartel and get someone to use it or even care
       | 
       | https://paperswithcode.com/dataset/debatesum
       | 
       | https://huggingface.co/datasets/Hellisotherpeople/DebateSum
       | 
       | https://scholar.google.com/citations?user=uHozRV4AAAAJ&hl=en
        
       | theknocker wrote:
       | So the AI has already taken over its own research.
        
       | sqrt17 wrote:
       | It takes substantial effort to build a good dataset,
       | proportionally more if it gets bigger, and people like big
       | datasets because you can train more powerful models from them. So
       | I am not surprised that people tend to gravitate towards datasets
       | made by well-funded institutions.
       | 
       | The alternative is either a small dataset that people heavily
       | overfit (eg the MUC6 corpus that was heavily used for coreference
       | at some point where people cared more about getting high numbers
       | than useful results) or things like the Universal Dependencies
       | corpus which are provided by a large consortium of smaller
       | institutions
        
       | lifeisstillgood wrote:
       | So, like humans, training an AI takes lots of curated data
       | (education, journalism, political savvy and avoidance of bias or
       | corruption). It took a very long time to get that corpus ready
       | and available for humanity (say 10,000 years).
       | 
       | Now we have exploded that corpus to include everything anyone
       | says online, and we are worried that the lack of curation means
       | we cannot be sure what the models will come back with.
       | 
       | Its rather like sending our kids out to find an education
       | themselves, and finding three come back with nothing, two spent
       | years learning from the cesspit of extremism, two are drug
       | addicts hitting "more" and one stumbled into a library.
       | 
       | Just a thought but I dont think journalism is really about
       | writing newspaper articles. It really is about _curating_ the
       | whole wide world and coming back with  "this is what you need to
       | know". Journalism is the curation AI needs...
        
         | visarga wrote:
         | Kids (and GPT) say the darndest things...
        
       | oakfr wrote:
       | This article is completely disconnected from what is happening in
       | the machine learning community.
       | 
       | Granted, datasets have grown larger and larger over time. This
       | concentration is actually a healthy sign of a community maturing
       | and converging towards common benchmarks.
       | 
       | These datasets are open to all for research and have fueled
       | considerable progress both in academia and industry.
       | 
       | The authors would be well-advised to look at what is happening in
       | the community. Excerpts from the program of NeurIPS happening
       | literally this week:
       | 
       | Panels: - The Consequences of Massive Scaling in Machine Learning
       | - The Role of Benchmarks in the Scientific Progress of Machine
       | Learning - How Copyright Shapes Your Datasets and What To Do
       | About It - How Should a Machine Learning Researcher Think About
       | AI Ethics?
       | 
       | All run by top-notch people coming from both academia and
       | industry, and from a variety of places in the world.
       | 
       | I am not saying that everything is perfect, but this article
       | paints a much darker picture than needed.
        
       | Terry_Roll wrote:
       | Just another example of academic conformity.
        
       | gibsonf1 wrote:
       | I think they are highly overestimating how much science advanced
       | pattern matching will provide. Certainly no conceptual
       | understanding will ever come from that.
        
         | amcoastal wrote:
         | Advanced pattern matching is called science and has been how
         | humans have made progress by taking data and writing down
         | models for them. Now computers do it. End to end algorithms are
         | 'black box' but machine learning in general is much broader
         | than that and is providing understanding in many fields. Sorry
         | you only see/know the "draw a box around a pedestrian' stuff,
         | but try not to judge entire fields based off of limited
         | exposure.
        
         | DannyBee wrote:
         | I think you are highly underestimating how much science and
         | math comes mainly from advanced pattern matching.
         | 
         | Most stuff is proven by using advanced pattern matching as
         | "intuition", and then breaking problems down into things we can
         | pattern match and prove.
         | 
         | I'm not sure what conceptual understanding you think _isn 't_
         | pattern matching. They are just rules about patterns and
         | interactions of patterns. These rules were developed through
         | pattern matching.
        
       | dexter89_kp3 wrote:
       | With increasing evidence that self-supervision works for multiple
       | tasks and architectures, and sim2real/simulated data being used
       | in industry I do not see this as an important concern.
        
       | andreyk wrote:
       | Ugh the title of the article.
       | 
       | Here's the paper that this article is recapping - "Reduced,
       | Reused and Recycled: The Life of a Dataset in Machine Learning
       | Research", https://openreview.net/forum?id=zNQBIBKJRkd
       | 
       | Abstract: "Benchmark datasets play a central role in the
       | organization of machine learning research. They coordinate
       | researchers around shared research problems and serve as a
       | measure of progress towards shared goals. Despite the
       | foundational role of benchmarking practices in this field,
       | relatively little attention has been paid to the dynamics of
       | benchmark dataset use and reuse, within or across machine
       | learning subcommunities. In this paper, we dig into these
       | dynamics. We study how dataset usage patterns differ across
       | machine learning subcommunities and across time from 2015-2020.
       | We find increasing concentration on fewer and fewer datasets
       | within task communities, significant adoption of datasets from
       | other tasks, and concentration across the field on datasets that
       | have been introduced by researchers situated within a small
       | number of elite institutions. Our results have implications for
       | scientific evaluation, AI ethics, and equity/access within the
       | field."
       | 
       | The reviews seem quite positive, if short. On a skim it looks
       | very solid, offering an empirical look at the dynamics of
       | benchmark usage that IMO seems unprecedented, so I'm not
       | surprised it got positive reviews and accepted.
        
         | oakfr wrote:
         | FWIW, this paper is being presented at NeurIPS this week.
        
       | whimsicalism wrote:
       | > Among their findings - based on core data from the Facebook-led
       | community project Papers With Code (PWC)
       | 
       | Oh boy. PWC is not even close to a representative sample of what
       | datasets are being used in papers. It's also often out of date.
        
         | ansgri wrote:
         | When you cannot analyze every paper ever published you need to
         | have some relevant criterion for inclusion -- and PWC is
         | somewhat of a GitHub of AI science (at least in early days of
         | it) -- there may be multiple collections, but PWC is by far the
         | most accessible and thus reasonably captures current and
         | emerging trends some dataset-heavy research fields.
        
       | ckastner wrote:
       | Also interesting in this context: this recent paper [1] analyzed
       | some of the most popular datasets and found numerous label errors
       | in the _test sets_. For example, they estimated at least 6% of
       | the samples in the ImageNet validation set were misclassified,
       | with surprising consequences when comparing model performances
       | using corrected sets.
       | 
       | [1] _Pervasive Label Errors in Test Sets Destabilize Machine
       | Learning Benchmarks_ , https://arxiv.org/pdf/2103.14749.pdf
        
       | diognesofsinope wrote:
       | I come from the econometrics end of things and Paul Romer (Nobel
       | Prize winner) put it well:
       | 
       | 'A fact beats theory every time...And I think economists have
       | focused on theory because it's easier than collecting data.'
       | 
       | When I read this article this is exactly what I thought of --
       | modeling/researchers always focus on the low hanging fruit which
       | is sitting in their comfy chair in their ~200 year old university
       | developing hyper-complicated models rather than going out and
       | collecting data that would answer their questions.
        
       | dontreact wrote:
       | When someone used the word "overfitting" I usually take that to
       | mean that a model has begun to enter the phase where further
       | improvement is leading to lower generalization.
       | 
       | In fact, as far as I can tell, we are not overfitting in this
       | sense. When I have seen papers examine whether progress on, let's
       | say, imagenet, actually generalizes to other categorization
       | datasets the answer is yes.
       | 
       | What we have been seeing is that the slope of this graph is
       | flattening out a bit. Whereas in the past a 1% improvement on
       | imagenet would have meant a 1% improvement on a similarly
       | collected dataset, nowadays it will be more like .5% (not exact
       | numbers just using numbers to illustrate what I mean by
       | diminishing returns.)
       | 
       | If an institution or a lab can show that progress on their
       | dataset -better- predicts the progress on a bunch of other
       | closely related tasks, then as researchers become convinced of
       | that, they will switch over. Right now there isn't a great
       | alternative because it's not easy to create such a dataset. Scale
       | is critical.
       | 
       | Imagenet really was on the right track as far as collecting
       | images of nearly every semantic concept in the English language.
       | So whatever replaces it will have to be similarly thorough and
       | broad.
       | 
       | In my opinion the biggest weakness of currently existing datasets
       | is that they are typically labeled once per image with no review
       | step. So I think the answer here isn't
       | 
       | "Let's get researchers to use smaller datasets from smaller
       | institutions"
       | 
       | It would be more like
       | 
       | "We have to figure out a way to get a bigger, cleaner version of
       | existing datasets and then prove that progress on those datasets
       | is more meaningful"
       | 
       | The realistic way this plays out is that some institution in the
       | "cartel" releases a better dataset and then lots of small labs
       | try it out and show that progress on that dataset better predicts
       | progress in general.
        
         | visarga wrote:
         | Lately there have been amazing results on pairs of web
         | text+image, no need for labelling. These datasets are hundreds
         | of times larger and cover many more categories of objects.
         | GPT-3 is also trained on raw text. I think ImageNet and its
         | generation have become toy datasets by now.
        
         | mattkrause wrote:
         | I disagree: some "collective overfitting" happens when everyone
         | evaluates on the same dataset (and often, the same test set).
         | 
         | There's a neat set of papers by Recht et al. showing results
         | are slightly overfit to the test partitions of ImageNet and
         | CIFAR-10: rotating examples between the train and test
         | partitions causes systems to perform up to 10-15% worse.
         | 
         | https://arxiv.org/abs/1806.00451
         | https://arxiv.org/abs/1902.10811
         | 
         | There's another neat bit of work involving MNIST. The original
         | dataset (from the mid-90s) had 60,000 _test_ examples, but the
         | distributed versions that virtually everyone uses has only
         | 10,000 test examples. Performance on these held-out examples
         | is, unsurprisingly, a bit worse:
         | 
         | https://arxiv.org/pdf/1905.10498.pdf
        
           | dontreact wrote:
           | I think it's super important to separate the following two
           | situations, both of which I suppose are fair to call
           | overfitting
           | 
           | Situation A: Models are slightly overfit to some portions of
           | the test set. But the following holds
           | 
           | IF PerformanceOnBenchmark(Model A) >
           | PerformanceOnBenchmark(Model B) Then
           | PerformanceOnSimilarDaset(Model A) >
           | PerformanceOnSimilarDataset(Model B)
           | 
           | Therefore progress on the benchmark is predictive of progress
           | in general.
           | 
           | Situation B: The relation does not hold, and therefore
           | progress on the benchmark does not predict general progress.
           | This almost always happen if you train a deep neural network
           | long enough: train performance goes up, but test performance
           | goes down.
           | 
           | If you look at figure 2 of the first paper you sent, you will
           | note that it shows we are in situation A and not situation B.
           | 
           | Situation A overfitting = diminishing returns on improvements
           | on benchmark, but the benchmark is still useful. Situation B
           | overfitting = the benchmark is now useless
        
             | sdenton4 wrote:
             | >> "This almost always happen if you train a deep neural
             | network long enough: train performance goes up, but test
             | performance goes down."
             | 
             | This is a problem that is more common for classification
             | problems, I think. Generative and self-supervised models
             | (trained with augmentation) tend to just get better forever
             | (with some asymptote) because memorization isn't a viable
             | strategy.
             | 
             | I personally think image classification is mostly a silly
             | problem to judge new algos on as a result, and leads to all
             | kinds of nonsense as people try to extrapolate meaning from
             | new results.
        
       | gaudat wrote:
       | I'm extremely surprised to see CUHK on the list. I'm living 10
       | minutes from them and never knew they are a big player in the ML
       | scene. A brief search online didn't show up anything interesting
       | apart from the CelebA dataset.
       | 
       | Edit: After asking a friend seems most of their research is with
       | Chinese official research institutes and Sensetime. It makes
       | sense now.
        
       | adflux wrote:
       | Cartel, really?
        
         | chewbacha wrote:
         | Right? I thought so too, but this is the first definition from
         | the American heritage dictionary:
         | 
         | > A combination of independent business organizations formed to
         | regulate production, pricing, and marketing of goods by the
         | members.
         | 
         | And it does seem to apply -\\_(tsu)_/-
        
           | cdot2 wrote:
           | They're not regulating anything. They just make the best
           | datasets and those are the ones that get used
        
             | chaps wrote:
             | Legal regulation, no, but in the sense of having groups of
             | internal people approving the release of data, data cleanup
             | processes, data input, etc etc etc... yes, it's 100%
             | regulated. At least in the cybernetics sense.
        
               | cdot2 wrote:
               | That's true of all organizations that release data.
               | They're regulating their own data. They're not regulating
               | the use of that data
        
               | chaps wrote:
               | Since we're talking about this in a definitional context
               | you can just as easily argue that a drug cartel doesn't
               | regulate the use of their drugs..
        
               | cdot2 wrote:
               | A cartel regulates a market. Drug cartels regulate the
               | buying and selling of drugs. This "cartel" doesn't
               | regulate any market
        
       ___________________________________________________________________
       (page generated 2021-12-06 23:01 UTC)