[HN Gopher] A cartel of influential datasets are dominating mach...
___________________________________________________________________
A cartel of influential datasets are dominating machine learning
research
Author : Hard_Space
Score : 227 points
Date : 2021-12-06 10:46 UTC (12 hours ago)
(HTM) web link (www.unite.ai)
(TXT) w3m dump (www.unite.ai)
| omarhaneef wrote:
| A little confused:
|
| Are these big "dominant" institutions charging for this data? No,
| the spend a lot of resources putting them together and give them
| away free.
|
| Are they preventing others from giving away data? No, but it
| costs a lot and they bear that cost.
|
| Are they forcing smaller institutions to use their data? No. Its
| just a free resource they offer.
|
| Do they get the grants themselves because they have some kind of
| proprietary access? No, the whole point is that the benchmark is
| open and everyone has access.
|
| So they collect this data, vet it, propose their use for
| benchmarks and give it away free. What is the complaint? The
| problem is not even posed as "well, this data is overfitted in
| papers or we are solving narrower problems". [Edit: they sort of
| do, but leaving this here so the comment below makes sense as a
| follow up.]
|
| No the complaint is that these handful of institutions are giving
| away free data so too many people use it? Can we have more
| "problems" of this nature?
| michaelt wrote:
| _> So they collect this data, vet it, propose their use for
| benchmarks and give it away free. What is the complaint?_
|
| It's well known that neural networks can easily inherit biases
| from their training data. It's also well known that datasets
| generated by western universities are widely used in training
| and evaluating neural networks.
|
| If my training set is full of pictures of Stanford CS
| undergraduates, I could end up with a computational photography
| system that makes everyone look like Stanford CS
| undergraduates, or a historical photo colourisation system that
| makes everyone look like Stanford CS undergraduates, or a self
| driving car pedestrian tracking system that expects 90% of
| pedestrians to look like Stanford CS undergraduates.
|
| And if the people who make the model say "Hey, our model's
| biases aren't our responsibility, we're just representing the
| training data as best we can" and the people who make the
| training data say "Hey, we never claimed it was perfect, you
| can take it or leave it" these problems might fall through the
| cracks.
| 908B64B197 wrote:
| > It's well known that neural networks can easily inherit
| biases from their training data. It's also well known that
| datasets generated by western universities are widely used in
| training and evaluating neural networks. If my training set
| is full of pictures of Stanford CS undergraduates, I could
| end up with a computational photography system that makes
| everyone look like Stanford CS undergraduates, or a
| historical photo colourisation system that makes everyone
| look like Stanford CS undergraduates, or a self driving car
| pedestrian tracking system that expects 90% of pedestrians to
| look like Stanford CS undergraduates.
|
| Why then aren't foreign universities/companies simply...
| building their own datasets?
| mattnewton wrote:
| > And if the people who make the model say "Hey, our model's
| biases aren't our responsibility, we're just representing the
| training data as best we can"
|
| I don't think this excuse is like the others. If the model
| doesn't work well because they used bias data, it is the job
| of the people making the model to find better data (or
| manipulate the training process to overweight some data and
| attempt to counteract the bias).
|
| I think the burden of responsibility has to be on the people
| who make models or put models into products to make sure the
| model is a good fit for the problem it is solving.
| lumost wrote:
| I believe the author is suggesting that large institutions are
| doing this to earn extra citations - the currency of academia.
|
| A large institution can make a dataset for X then browbeat
| other researchers into using X and citing X. Using X also
| likely leads to citations of derivative work by the lead
| institution.
| omarhaneef wrote:
| Is the term "browbeat" fair here? I don't think they're
| making calls and saying "Oh, nice paper, but I notice you
| used this other dataset..." No, they're putting out good
| quality datasets that people want to use. If that earns them
| a citation, good for them. Its the least I can do for helping
| me test my algo.
| lumost wrote:
| It depends, feedback is rarely as polite as "I noticed you
| used this other dataset". The feedback would probably look
| like.
|
| - "Nice paper, however the results are not relevant to
| current research due to the use of X dataset rather than Y
| or Z datasets score 2/5 do not accept."
|
| - "Nice paper, however the results are of unknown quality
| due to the use of X dataset 3/5 recommend poster track".
|
| In fact I'd generally say that most paper reviews would
| drop the first three words of those feedbacks. It's not an
| unreasonable assertion that progress is measured on
| standard datasets - but it's also necessary to push back on
| this.
| durovo wrote:
| If a non-standard dataset is being used, I would expect
| there to be a discussion/analysis on what characteristics
| of that dataset made it unusable for this paper.
| Especially if a proposed model is being compared against
| models that were trained on those standard datasets.
|
| If you are establishing new baselines using those same
| models on your non-standard dataset, then one would
| expect you to put in a good amount of effort to finetune
| all the knobs to get a reasonable result. If the authors
| put are able to put in that much effort, then that kind
| of feedback is definitely unreasonable.
| YeGoblynQueenne wrote:
| >> If a non-standard dataset is being used, I would
| expect there to be a discussion/analysis on what
| characteristics of that dataset made it unusable for this
| paper.
|
| Unfortunately that just adds more work for the reviewer,
| which is a motive for many reviewers to scrap the paper
| so they don't have to do the extra work.
|
| That sounds mean, so I will quote (yet again) Geoff
| Hinton on things that "make the brain hurt":
|
| _GH: One big challenge the community faces is that if
| you want to get a paper published in machine learning now
| it 's got to have a table in it, with all these different
| data sets across the top, and all these different methods
| along the side, and your method has to look like the best
| one. If it doesn't look like that, it's hard to get
| published. I don't think that's encouraging people to
| think about radically new ideas._
|
| _Now if you send in a paper that has a radically new
| idea, there 's no chance in hell it will get accepted,
| because it's going to get some junior reviewer who
| doesn't understand it. Or it's going to get a senior
| reviewer who's trying to review too many papers and
| doesn't understand it first time round and assumes it
| must be nonsense. Anything that makes the brain hurt is
| not going to get accepted. And I think that's really
| bad._
|
| https://www.wired.com/story/googles-ai-guru-computers-
| think-...
|
| Basically a new dataset is like a new idea: it makes the
| brain hurt, for the overburdened experienced researcher
| or inexperienced younger researcher alike. Testing a new
| approach on a new dataset? That makes brain go boom.
|
| Which is a funny state of affairs. Not so long ago it
| used to be that one sure-fire way to make a significant
| contribution that would give your paper a leg up over the
| competition was to create a new dataset. I was advised as
| much at the start of my PhD (four ish years ago). Seems
| like this has already changed.
| durovo wrote:
| You might be right here. My comment was more of my
| expectation as a reader on what should be present in such
| a paper.
| blt wrote:
| Absolutely. Failure to report results on a popular
| benchmark suggests to some reviewers that you have
| something to hide - even though they might be
| computationally expensive or tangential to the main point
| of the work.
| Levitz wrote:
| >The problem is not even posed as "well, this data is
| overfitted in papers or we are solving narrower problems".
|
| I mean it does get into that:
|
| "They additionally note that blind adherence to this small
| number of 'gold' datasets encourages researchers to achieve
| results that are overfitted (i.e. that are dataset-specific and
| not likely to perform anywhere near as well on real-world data,
| on new academic or original datasets, or even necessarily on
| different datasets in the 'gold standard')."
| omarhaneef wrote:
| True, I guess my complaint is that they don't get into the
| mechanisms.
|
| The sibling comment on benchmark lottery paper lays this out.
| But I should modify.
| [deleted]
| sabhiram wrote:
| If there was a dataset that solved a problem sufficiently
| (perception, self driving, whatever), there would be no more need
| for any study of that field (other than for marginal
| improvements).
|
| Once solved, these no longer are research areas. Getting X score
| on a Y validation set means you get your s/w drivers license.
| Derbasti wrote:
| This is what my dissertation was about, in a different field:
| Everyone using the same datasets, even though they are severely
| flawed.
|
| As it turned out, differences between datasets proved
| significantly larger (by a big margin) than differences between
| algorithms. And the most popular datasets in fact included biases
| and eccentricities that were bound to cause such problems.
|
| https://bastibe.github.io/Dissertation-Website/ (figures
| 12.9-12.11, if you're interested)
| fock wrote:
| thank you! I also found the same thing during my PhD and it
| took a basic-pay-topublish paper and 2 years to get my
| supervisor to agree that this is not really going anywhere with
| publications and switch to a field, where I am more comfortable
| with producing publications. Essentially my field has 1 big
| dataset, and then domain experts "improve" models by creating
| their own data (200 samples) and then claim a "novel" method
| (which is not transferable at all).
|
| The pay2publish-paper was sent back by 2 journals with reviews
| indicating exactly "our method is better, just use the right
| (our) thing" (which I _all_ refuted for the professor by doing
| it, but the editor wouldn't hear anymore...). And then there's
| papers predicting features through a complex preprocessing
| pipeline in these journals. Academia and big companies are just
| idiotic.
| Derbasti wrote:
| Funny that you would mention trying to publish such findings
| in journals. I tried... more than two times, as well.
| Rejected with the most spurious of claims, despite my
| presenting evidence disproving them. Frankly, these journal
| submissions were _the_ most frustrating and trying
| experiences in my professional life.
|
| Oh well. I'm glad I got my degree, and could leave academia
| relatively unharmed.
|
| Now I work in AI (engineering!), where most science is
| highschool-level "hey, I tried stuff, and things happened.
| Dunno why, though". It's just ridiculous.
| zibzab wrote:
| Somewhat related to this: What methods do people normally use to
| measure the quality of a dataset?
|
| For example, if there are 50 datasets of historical weather data
| how can I determine which one is garbage?
| dunefox wrote:
| I would say it's garbage if there's a paper like this:
| https://arxiv.org/abs/1902.01007
| visarga wrote:
| That's like hard negative mining. You can also train a weak
| model and filter out the examples it manages to predict
| correctly to come up with a harder dataset.
| [deleted]
| YeGoblynQueenne wrote:
| >> According to the paper, Computer Vision research is notably
| more affected by the syndrome it outlines than other sectors,
| with the authors noting that Natural Language Processing (NLP)
| research is far less affected. The authors suggest that this
| could be because NLP communities are 'more coherent' and larger
| in size, and because NLP datasets are more accessible and easier
| to curate, as well as being smaller and less resource-intensive
| in terms of data-gathering.
|
| I'll have to read the paper to see what exactly it says on this
| but my knowledge of NLP benchmark datasets is exactly the
| opposite: the majority are simply no use for measuring the
| capabilities they're supposed to be measuring. For example,
| natural language _understanding_ datasets are typically created
| as multiple-choice questionnaires, so that, a) there is already a
| baseline accuracy that a system can "hit" just by chance (but
| which is almost never noted, or compared against, in papers) and
| b) a system that's good at classification can beat the benchmarks
| black and blue without doing any "understanding". And this is,
| indeed, what's been going on for quite a while now with large
| language models that take all the trophies and are still dumb as
| bricks.
|
| To make matters worse, NLP also doesn't have any good _metrcis_
| of performance. Stuff like the BLEU scores are just laughably
| inadequate. NLP is all bad metrics over bad benchmarks. _And_ NLP
| results are much harder to just "eyball" than machine vision
| results (and the models are much harder to interpret than machine
| vision models where you can at least visualise the activations
| and see... something). I think NLP is much, much worse than
| machine vision.
| xiaolingxiao wrote:
| Your suspicion is correct. I worked on such a dataset paper and
| worked to "beat" other methods on well-accepted benchmarks with
| dubious accuracy scores. One fundamental issue is that outside
| of POS tagging, there isn't a notion of empirical truth to
| measure against, only a small sample of what a "normal person
| would think." This stands in contrast to computer vision,
| whereby in a task such as monocular depth perception from a
| single frame, you can always measure against a Lidar-scanned
| depth map. The system can still overfit on the benchmark and
| they do, but at least the baseline truth itself is not in
| dispute. But questions such as: is this the "appropriate"
| response to a query is too open to interpretation.
| hamasho wrote:
| I think it's good that well-founded institutions publish well-
| crafted datasets for everyone else. It helps small teams develop
| their model.
|
| The problem here is, though, most publishers don't accept papers
| if the new proposal isn't backed by benchmarks by the well-known
| datasets. Even though it can be a competitive approach for a
| specific field, they reject the paper anyway if it's not
| performed well on the datasets.
| throwawayay02 wrote:
| Do they really reject them though? In my experience publishers
| will publish anything, if anything they should have higher
| standards not lower.
| domenicrosati wrote:
| For brand name journals and conferences, yes they do. My
| experience with those is a very high rejection rate.
|
| There is a good reason to reject if you are not using a
| standard dataset. How can you compare the results of two
| approaches, to say natural language inference of whether one
| sentence entails another, without results being tested on the
| same dataset?
|
| I think one thing overlooked in the conversation is that many
| papers start with a standard baseline and then use another
| dataset to establish additional results.
|
| In my experience in nlp also, journals and conferences tend
| to establish datasets of their own when they make a call for
| submission. Often these are called the "shared task" track.
| ACL has operated this way for decades.
| irjustin wrote:
| It's a pattern matching problem. It takes a lot of work and too
| many unknowns when dealing with "new benchmark sets" or ...
| just any other set.
| F6F6FA wrote:
| The word "cartel" is too negative, especially when discussing
| political and social factors. The narrative seems written, before
| their findings. Or at least, it could well have.
|
| For a computer science analogy: It is a paper with the finding
| that most successful computer languages are created at
| prestigious institutes. An obvious -not a bad - finding. Not like
| you could give the motivation, skills, expertise, resources, and
| time to a small new institute, and expect these to come up with a
| new language which the community will adopt.
|
| Yes, if you write and publish a good data set, and it gets
| adopted by the community, then you gain lots of citations. This
| reward is known, and therefore some researchers expend the effort
| of gathering and curating all this data.
|
| It is not a "vehicle for inequality in science". Benchmarks in ML
| are a way to create an equal playing field for all, and allows
| one to compare results. Picking a non-standard new benchmark to
| evaluate your algorithm is bad practice. And benchmarks are the
| true meritocracy. Beat the benchmark, and you too can publish. No
| matter the PR or extra resources from big labs. It is test
| evaluation that counts, and this makes it fair. Other fields may
| have authorities writing papers without even an evaluation.
| That's not a good position for a field to be in.
|
| > The prima facie scientific validity granted by SOTA
| benchmarking is generically confounded with the social
| credibility researchers obtain by showing they can compete on a
| widely recognized dataset
|
| Here, authors pretend social credibility of researchers has any
| sway. There is no social credibility for a Master's student in
| Bangladesh, but when they show they can compete, then they can
| join and publish. Wonderful!
|
| Where the authors use the long history of train-test splits, to
| pose the cons have outweighed the benefits, they should reason
| more and provide more data to actually show this and get the
| field to get along. Ironically, people take more note of this
| very paper, due to the institution affiliation of the authors. I
| do too. If they had a benchmark, I would have first looked at
| that.
|
| > Given the observed high concentration of research on a small
| number of benchmark datasets, we believe diversifying forms of
| evaluation is especially important to avoid overfitting to
| existing datasets and misrepresenting progress in the field.
|
| I believe these authors find diversity important. But for
| overfitting, these should look at actual (meta-) studies and
| data. This seems conflicting. For instance:
|
| > A Meta-Analysis of Overfitting in Machine Learning (2019)
|
| > We conduct the first large meta-analysis of overfitting due to
| test set reuse in the machine learning community. Our analysis is
| based on over one hundred machine learning competitions hosted on
| the Kaggle platform over the course of several years. In each
| competition, numerous practitioners repeatedly evaluated their
| progress against a holdout set that forms the basis of a public
| ranking available throughout the competition. Performance on a
| separate test set used only once determined the final ranking. By
| systematically comparing the public ranking with the final
| ranking, we assess how much participants adapted to the holdout
| set over the course of a competition. Our study shows, somewhat
| surprisingly, little evidence of substantial overfitting. These
| findings speak to the robustness of the holdout method across
| different data domains, loss functions, model classes, and human
| analysts.
| carschno wrote:
| Too bad they don't cite the paper "The Benchmark Lottery" (M.
| Dehghani et al, 2021) (https://arxiv.org/abs/2107.07002)
|
| > The world of empirical machine learning (ML) strongly relies on
| benchmarks in order to determine the relative effectiveness of
| different algorithms and methods. This paper proposes the notion
| of "a benchmark lottery" that describes the overall fragility of
| the ML benchmarking process. The benchmark lottery postulates
| that many factors, other than fundamental algorithmic
| superiority, may lead to a method being perceived as superior. On
| multiple benchmark setups that are prevalent in the ML community,
| we show that the relative performance of algorithms may be
| altered significantly simply by choosing different benchmark
| tasks, highlighting the fragility of the current paradigms and
| potential fallacious interpretation derived from benchmarking ML
| methods. Given that every benchmark makes a statement about what
| it perceives to be important, we argue that this might lead to
| biased progress in the community. We discuss the implications of
| the observed phenomena and provide recommendations on mitigating
| them using multiple machine learning domains and communities as
| use cases, including natural language processing, computer
| vision, information retrieval, recommender systems, and
| reinforcement learning.
|
| Edit: By "they", I was actually referring to the linked article.
| Strangely, even the paper the article is about does not cite "The
| Benchmark Lottery" at all.
| stayfrosty420 wrote:
| Seems like a distinct issue to me, though there are obvious
| parellels and similar outcomes.
| koheripbal wrote:
| Because AIs are relatively narrowly focused, they suffer
| greatly from limited data sets. More often than not, an AI
| will simply memorize test data rather than "learning".
|
| This impacts the benchmarking process, so I think it's
| relevant.
| mellavora wrote:
| Hey, you're giving away my secret for passing AWS
| certification exams!
| carschno wrote:
| Agree, I was not saying it was redundant. But I would have
| expected a reference, as that paper was the first related
| item that came to my mind.
| sea-shunned wrote:
| It's also a shame that this paper in turn doesn't cite "Testing
| Heuristics: We Have It All Wrong" (J. N. Hooker, 1995) (http://
| citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71....),
| which discusses these issues in much the same way. It's good to
| see in "The Benchmark Lottery", however, that they look more
| into specific tasks and their algorithmic rankings and provide
| some sound recommendations.
|
| One thing that I'd add (somewhat selfishly as it relates to my
| PhD work), is the idea of generating datasets that are
| deliberately challenging for different algorithms. Scale this
| across a test suite of algorithms, and their relative strengths
| and weaknesses become clearer. The caveat here is that it
| requires having a set of measures that quantify different types
| of problem difficulty, which depending on the task/domain can
| range from well-defined to near-impossible.
| gopher_space wrote:
| I've been looking at parsing paragraph structure and have
| started thinking about a conceptual mechanical turk/e e
| cummings line in the sand where it's just going to be easier
| to pay some kid with a cell phone to read words for you. The
| working implementations I've seen are heavily tied to domain
| and need to nail down language, which isn't really a thing.
|
| Quantification is fascinating, it seems to be something I
| take for granted until I actually want to make decisions.
| It's like I'm constantly trying to forget that analog and
| digital are two totally separate concepts. I wouldn't really
| recommend reading Castaneda to anyone but he describes people
| living comfortably with mutually exclusive ideas in their
| head walled off by context, and I'd like that sort of
| understanding.
| Der_Einzige wrote:
| Well, I recently released a NLP dataset of almost 200K documents
| and I got 4 whole citations in a year. Wish I could find a way to
| join this cartel and get someone to use it or even care
|
| https://paperswithcode.com/dataset/debatesum
|
| https://huggingface.co/datasets/Hellisotherpeople/DebateSum
|
| https://scholar.google.com/citations?user=uHozRV4AAAAJ&hl=en
| theknocker wrote:
| So the AI has already taken over its own research.
| sqrt17 wrote:
| It takes substantial effort to build a good dataset,
| proportionally more if it gets bigger, and people like big
| datasets because you can train more powerful models from them. So
| I am not surprised that people tend to gravitate towards datasets
| made by well-funded institutions.
|
| The alternative is either a small dataset that people heavily
| overfit (eg the MUC6 corpus that was heavily used for coreference
| at some point where people cared more about getting high numbers
| than useful results) or things like the Universal Dependencies
| corpus which are provided by a large consortium of smaller
| institutions
| lifeisstillgood wrote:
| So, like humans, training an AI takes lots of curated data
| (education, journalism, political savvy and avoidance of bias or
| corruption). It took a very long time to get that corpus ready
| and available for humanity (say 10,000 years).
|
| Now we have exploded that corpus to include everything anyone
| says online, and we are worried that the lack of curation means
| we cannot be sure what the models will come back with.
|
| Its rather like sending our kids out to find an education
| themselves, and finding three come back with nothing, two spent
| years learning from the cesspit of extremism, two are drug
| addicts hitting "more" and one stumbled into a library.
|
| Just a thought but I dont think journalism is really about
| writing newspaper articles. It really is about _curating_ the
| whole wide world and coming back with "this is what you need to
| know". Journalism is the curation AI needs...
| visarga wrote:
| Kids (and GPT) say the darndest things...
| oakfr wrote:
| This article is completely disconnected from what is happening in
| the machine learning community.
|
| Granted, datasets have grown larger and larger over time. This
| concentration is actually a healthy sign of a community maturing
| and converging towards common benchmarks.
|
| These datasets are open to all for research and have fueled
| considerable progress both in academia and industry.
|
| The authors would be well-advised to look at what is happening in
| the community. Excerpts from the program of NeurIPS happening
| literally this week:
|
| Panels: - The Consequences of Massive Scaling in Machine Learning
| - The Role of Benchmarks in the Scientific Progress of Machine
| Learning - How Copyright Shapes Your Datasets and What To Do
| About It - How Should a Machine Learning Researcher Think About
| AI Ethics?
|
| All run by top-notch people coming from both academia and
| industry, and from a variety of places in the world.
|
| I am not saying that everything is perfect, but this article
| paints a much darker picture than needed.
| Terry_Roll wrote:
| Just another example of academic conformity.
| gibsonf1 wrote:
| I think they are highly overestimating how much science advanced
| pattern matching will provide. Certainly no conceptual
| understanding will ever come from that.
| amcoastal wrote:
| Advanced pattern matching is called science and has been how
| humans have made progress by taking data and writing down
| models for them. Now computers do it. End to end algorithms are
| 'black box' but machine learning in general is much broader
| than that and is providing understanding in many fields. Sorry
| you only see/know the "draw a box around a pedestrian' stuff,
| but try not to judge entire fields based off of limited
| exposure.
| DannyBee wrote:
| I think you are highly underestimating how much science and
| math comes mainly from advanced pattern matching.
|
| Most stuff is proven by using advanced pattern matching as
| "intuition", and then breaking problems down into things we can
| pattern match and prove.
|
| I'm not sure what conceptual understanding you think _isn 't_
| pattern matching. They are just rules about patterns and
| interactions of patterns. These rules were developed through
| pattern matching.
| dexter89_kp3 wrote:
| With increasing evidence that self-supervision works for multiple
| tasks and architectures, and sim2real/simulated data being used
| in industry I do not see this as an important concern.
| andreyk wrote:
| Ugh the title of the article.
|
| Here's the paper that this article is recapping - "Reduced,
| Reused and Recycled: The Life of a Dataset in Machine Learning
| Research", https://openreview.net/forum?id=zNQBIBKJRkd
|
| Abstract: "Benchmark datasets play a central role in the
| organization of machine learning research. They coordinate
| researchers around shared research problems and serve as a
| measure of progress towards shared goals. Despite the
| foundational role of benchmarking practices in this field,
| relatively little attention has been paid to the dynamics of
| benchmark dataset use and reuse, within or across machine
| learning subcommunities. In this paper, we dig into these
| dynamics. We study how dataset usage patterns differ across
| machine learning subcommunities and across time from 2015-2020.
| We find increasing concentration on fewer and fewer datasets
| within task communities, significant adoption of datasets from
| other tasks, and concentration across the field on datasets that
| have been introduced by researchers situated within a small
| number of elite institutions. Our results have implications for
| scientific evaluation, AI ethics, and equity/access within the
| field."
|
| The reviews seem quite positive, if short. On a skim it looks
| very solid, offering an empirical look at the dynamics of
| benchmark usage that IMO seems unprecedented, so I'm not
| surprised it got positive reviews and accepted.
| oakfr wrote:
| FWIW, this paper is being presented at NeurIPS this week.
| whimsicalism wrote:
| > Among their findings - based on core data from the Facebook-led
| community project Papers With Code (PWC)
|
| Oh boy. PWC is not even close to a representative sample of what
| datasets are being used in papers. It's also often out of date.
| ansgri wrote:
| When you cannot analyze every paper ever published you need to
| have some relevant criterion for inclusion -- and PWC is
| somewhat of a GitHub of AI science (at least in early days of
| it) -- there may be multiple collections, but PWC is by far the
| most accessible and thus reasonably captures current and
| emerging trends some dataset-heavy research fields.
| ckastner wrote:
| Also interesting in this context: this recent paper [1] analyzed
| some of the most popular datasets and found numerous label errors
| in the _test sets_. For example, they estimated at least 6% of
| the samples in the ImageNet validation set were misclassified,
| with surprising consequences when comparing model performances
| using corrected sets.
|
| [1] _Pervasive Label Errors in Test Sets Destabilize Machine
| Learning Benchmarks_ , https://arxiv.org/pdf/2103.14749.pdf
| diognesofsinope wrote:
| I come from the econometrics end of things and Paul Romer (Nobel
| Prize winner) put it well:
|
| 'A fact beats theory every time...And I think economists have
| focused on theory because it's easier than collecting data.'
|
| When I read this article this is exactly what I thought of --
| modeling/researchers always focus on the low hanging fruit which
| is sitting in their comfy chair in their ~200 year old university
| developing hyper-complicated models rather than going out and
| collecting data that would answer their questions.
| dontreact wrote:
| When someone used the word "overfitting" I usually take that to
| mean that a model has begun to enter the phase where further
| improvement is leading to lower generalization.
|
| In fact, as far as I can tell, we are not overfitting in this
| sense. When I have seen papers examine whether progress on, let's
| say, imagenet, actually generalizes to other categorization
| datasets the answer is yes.
|
| What we have been seeing is that the slope of this graph is
| flattening out a bit. Whereas in the past a 1% improvement on
| imagenet would have meant a 1% improvement on a similarly
| collected dataset, nowadays it will be more like .5% (not exact
| numbers just using numbers to illustrate what I mean by
| diminishing returns.)
|
| If an institution or a lab can show that progress on their
| dataset -better- predicts the progress on a bunch of other
| closely related tasks, then as researchers become convinced of
| that, they will switch over. Right now there isn't a great
| alternative because it's not easy to create such a dataset. Scale
| is critical.
|
| Imagenet really was on the right track as far as collecting
| images of nearly every semantic concept in the English language.
| So whatever replaces it will have to be similarly thorough and
| broad.
|
| In my opinion the biggest weakness of currently existing datasets
| is that they are typically labeled once per image with no review
| step. So I think the answer here isn't
|
| "Let's get researchers to use smaller datasets from smaller
| institutions"
|
| It would be more like
|
| "We have to figure out a way to get a bigger, cleaner version of
| existing datasets and then prove that progress on those datasets
| is more meaningful"
|
| The realistic way this plays out is that some institution in the
| "cartel" releases a better dataset and then lots of small labs
| try it out and show that progress on that dataset better predicts
| progress in general.
| visarga wrote:
| Lately there have been amazing results on pairs of web
| text+image, no need for labelling. These datasets are hundreds
| of times larger and cover many more categories of objects.
| GPT-3 is also trained on raw text. I think ImageNet and its
| generation have become toy datasets by now.
| mattkrause wrote:
| I disagree: some "collective overfitting" happens when everyone
| evaluates on the same dataset (and often, the same test set).
|
| There's a neat set of papers by Recht et al. showing results
| are slightly overfit to the test partitions of ImageNet and
| CIFAR-10: rotating examples between the train and test
| partitions causes systems to perform up to 10-15% worse.
|
| https://arxiv.org/abs/1806.00451
| https://arxiv.org/abs/1902.10811
|
| There's another neat bit of work involving MNIST. The original
| dataset (from the mid-90s) had 60,000 _test_ examples, but the
| distributed versions that virtually everyone uses has only
| 10,000 test examples. Performance on these held-out examples
| is, unsurprisingly, a bit worse:
|
| https://arxiv.org/pdf/1905.10498.pdf
| dontreact wrote:
| I think it's super important to separate the following two
| situations, both of which I suppose are fair to call
| overfitting
|
| Situation A: Models are slightly overfit to some portions of
| the test set. But the following holds
|
| IF PerformanceOnBenchmark(Model A) >
| PerformanceOnBenchmark(Model B) Then
| PerformanceOnSimilarDaset(Model A) >
| PerformanceOnSimilarDataset(Model B)
|
| Therefore progress on the benchmark is predictive of progress
| in general.
|
| Situation B: The relation does not hold, and therefore
| progress on the benchmark does not predict general progress.
| This almost always happen if you train a deep neural network
| long enough: train performance goes up, but test performance
| goes down.
|
| If you look at figure 2 of the first paper you sent, you will
| note that it shows we are in situation A and not situation B.
|
| Situation A overfitting = diminishing returns on improvements
| on benchmark, but the benchmark is still useful. Situation B
| overfitting = the benchmark is now useless
| sdenton4 wrote:
| >> "This almost always happen if you train a deep neural
| network long enough: train performance goes up, but test
| performance goes down."
|
| This is a problem that is more common for classification
| problems, I think. Generative and self-supervised models
| (trained with augmentation) tend to just get better forever
| (with some asymptote) because memorization isn't a viable
| strategy.
|
| I personally think image classification is mostly a silly
| problem to judge new algos on as a result, and leads to all
| kinds of nonsense as people try to extrapolate meaning from
| new results.
| gaudat wrote:
| I'm extremely surprised to see CUHK on the list. I'm living 10
| minutes from them and never knew they are a big player in the ML
| scene. A brief search online didn't show up anything interesting
| apart from the CelebA dataset.
|
| Edit: After asking a friend seems most of their research is with
| Chinese official research institutes and Sensetime. It makes
| sense now.
| adflux wrote:
| Cartel, really?
| chewbacha wrote:
| Right? I thought so too, but this is the first definition from
| the American heritage dictionary:
|
| > A combination of independent business organizations formed to
| regulate production, pricing, and marketing of goods by the
| members.
|
| And it does seem to apply -\\_(tsu)_/-
| cdot2 wrote:
| They're not regulating anything. They just make the best
| datasets and those are the ones that get used
| chaps wrote:
| Legal regulation, no, but in the sense of having groups of
| internal people approving the release of data, data cleanup
| processes, data input, etc etc etc... yes, it's 100%
| regulated. At least in the cybernetics sense.
| cdot2 wrote:
| That's true of all organizations that release data.
| They're regulating their own data. They're not regulating
| the use of that data
| chaps wrote:
| Since we're talking about this in a definitional context
| you can just as easily argue that a drug cartel doesn't
| regulate the use of their drugs..
| cdot2 wrote:
| A cartel regulates a market. Drug cartels regulate the
| buying and selling of drugs. This "cartel" doesn't
| regulate any market
___________________________________________________________________
(page generated 2021-12-06 23:01 UTC)