[HN Gopher] Error-riddled data sets are warping our sense of how...
___________________________________________________________________
Error-riddled data sets are warping our sense of how good AI is
Author : djoldman
Score : 110 points
Date : 2021-04-02 13:49 UTC (2 days ago)
(HTM) web link (www.technologyreview.com)
(TXT) w3m dump (www.technologyreview.com)
| Veedrac wrote:
| This article doesn't make any major factual errors, but it's
| implicating that this is some sort of surprise, and that it's a
| major impediment, neither or which hold up. While training on
| noisy datasets is an interesting problem for many practical use-
| cases, a few percent label errors in these classic basic datasets
| (MNIST, CIFAR, ImageNet) really doesn't matter that much. The
| first two are basically toys to check that your system works at
| all at this point, and even on ImageNet the best models are now
| more accurate than the original labels.
|
| However, per [1], progress on current ImageNet still correlates
| with true accuracy. This is in large because we move to harder
| tests when the easy ones stop working. In part it's also good,
| because training with label noise forces the use of better-
| generalizing solutions. The current SOTA is Meta Pseudo
| Labels[2], which is a particularly clever trick that never even
| directly exposes the final model to the training data.
|
| [1] Are we done with ImageNet? --
| https://arxiv.org/abs/2006.07159
|
| [2] Meta Pseudo Labels -- https://arxiv.org/abs/2003.10580
| currio wrote:
| I think other result from the article (and paper it links to)
| is with more accurate data simpler models perform better than
| complicated models.
|
| This is a good observation because as the saying as "Garbage in
| garbage out" or "Model is only as good as data". It is better
| to focus on better data than spending time or improving models
| or overfitting complicated models to inaccurate data
| Guest42 wrote:
| Hadn't thought of it, but in my experience the "newer" models
| require a lot more data, this would them become an impediment
| because there's a certain likelihood the quality goes down
| with the increased quantity, i.e. measured differently or
| under different conditions.
| iudqnolq wrote:
| The article alleges that the error is non-random. There are
| some specific sorts of errors the humans on mechanical turk
| made, and the SOTA models get a significant portion of their
| advantage over earlier models based only on being better at
| guessing the incorrect labels. To a layperson, this sounds like
| a different issue to the one you refute. Is this not a big
| deal?
| Veedrac wrote:
| Label noise will almost always give you a weaker model, and
| sometimes it matters whether those are merely random or due
| to some general confusion (eg. people mistaking crabs for
| lobsters). But these datasets are best thought of as research
| tools, not generally directly useful for applied models.
|
| Whereas there are major practical issues with even small or
| subtle errors in, eg., medical or legal datasets for real-
| world models, a pure research dataset is generally fine as
| long as a higher score correlates with a smarter model. If
| MNIST, a nowadays-trivial handwritten digit recognition
| dataset, had all its '1's labels swapped with '3's labels, it
| would pretty much fill its role just as well.
| SpicyLemonZest wrote:
| It's good to hear that experts have strategies to handle the
| problem, but I think this will definitely be a surprise to most
| of the article's readers. I'm a relatively AI-interested
| layman, and I had assumed that these kinds of datasets had more
| like 99% accuracy.
| Veedrac wrote:
| Heh, I guess this is a difference in perspectives. ML has
| some really shoddy datasets, like really bad, worse than you
| think I mean.
|
| > The macro-averaged results show that the ratio of correct
| samples ("C") ranges from 24% to 87%, with a large variance
| across the five audited datasets. Particularly severe
| problems were found in CCAligned and WikiMatrix, with 44 of
| the 65 languages that we audited for CCAligned containing
| under 50% correct sentences, and 19 of the 20 in WikiMatrix.
| In total, 15 of the 205 language specific samples (7.3%)
| contained not a single correct sentence.
|
| Quality at a Glance: An Audit of Web-Crawled Multilingual
| Datasets -- https://arxiv.org/abs/2103.12028
|
| In the face of stuff like that, a 90+% correct dataset
| doesn't seem like a big deal.
| bramblerose wrote:
| https://labelerrors.com/ has an overview of the label errors
| uncovered in this research.
| IshKebab wrote:
| Most of the errors seem to be either:
|
| * The label is for something in the image, but there are
| multiple things in the image.
|
| * Getting animal species wrong.
| Jon_Lowtek wrote:
| That error correction has errors. Their consensus claims a
| label "chain link fence" is wrong on an image that includes a
| chain link fence, albeit as a lesser element of the images
| composition.
|
| https://labelerrors.com/static/imagenet/val/n03000134/ILSVRC...
|
| It is also interesting that their model guessed "container
| ship" with the dominant element of the picture being a harbor
| crane. I think there might be a lot of images in the ImageNet
| data set that include both a crane and a ship as strong
| elements, but only have the ship label.
|
| ImageNet is a single label set, not a multi label set and that
| is a known problem. This paper talks more about it:
| https://github.com/naver-ai/relabel_imagenet
|
| With large data sets containing millions of images the only way
| forward seems to have AI keep refining the data, escalating
| things multiple automatons disagree on to be reviewed by more
| classifiers, including humans.
| BlueTemplar wrote:
| So, do they have a higher or lower % of errors than the sets
| that they are studying? :p
| macspoofing wrote:
| Wasn't the entire point of machine learning to sift through noisy
| data and derive useful models or information? Why is this a
| surprise?
| beforeolives wrote:
| You need correct labels to train these models. The point of the
| article is that some of the most popular benchmark datasets
| have a significant proportion of incorrect labels.
| sdenton4 wrote:
| Actually, it's ok to have lots of incorrect training labels;
| the effect is usually just to shift the output probabilities
| downward.
|
| Where you really want/need accurate labels is the eval set,
| which the article addresses.
| cgn wrote:
| I am an author on the paper, and just saw this thread.
| Thanks for all of your comments. Note this work focuses on
| errors in test sets, not training labels. Test sets are
| foundation of benchmarking for machine learning. We study
| test sets (versus training sets), because training sets are
| already studied in-depth in confident learning
| (https://arxiv.org/pdf/1911.00068.pdf) and because before
| this paper, no one had studied (at scale, for lots of
| datasets) if these test sets had errors in the first place
| (beyond just looking at the well-known erroneous ImageNet
| dataset, for example) and do these test sets affect
| benchmarks. The answer is, yes, if the noise prevalence is
| fairly high (above 10%). The takeaway, is you must
| benchmark models on clean test sets, regardless of noise in
| your training set, if you want to know how your model will
| perform in the real world.
| supernova87a wrote:
| Is the April Fool's joke of this article (dated April 1) the fact
| that the error rate is 5.8% for ImageNet? Because last I heard,
| it wasn't exactly breaking news that any training dataset has
| some intrinsic error rate, or that some error rate means that the
| models are "flawed".
|
| Maybe I would find this more interesting if the question were,
| "how good does good enough have to be"?
| PeterisP wrote:
| If a good ImageNet model gets 88-90% accuracy on the test set
| (so, 10%-12% error rate) then if it turns out that 5.8% of the
| test set "gold standard" labels are actually wrong is a very
| big deal, it would mean that half of the measured differences
| aren't actually real but an artifact of the original human
| labelers' mistakes.
| iudqnolq wrote:
| The interesting point they make is that the error is non-
| random, and a significt part of why complex models are better
| than simple models is they're better at guessing the wrong
| labels.
| tmabraham wrote:
| People are studying ways to train accurate labels with noisy
| labels (see the paragraph in this article:
| https://tmabraham.github.io/blog/noisy_imagenette#Prior-rese...)
|
| Also, technically accuracy as a metric is robust to noise
| (https://arxiv.org/abs/2012.04193). That means that a model with
| the highest accuracy on a noisy dataset will likely be the best
| model on the clean dataset. So these noisy datasets can still be
| very useful for the development of deep learning models. In fact
| if you look at the tradeoffs between getting larger datasets that
| have noisy labels vs. smaller datasets will clean labels (since
| good annotation is expensive!), the noisy large-scale dataset
| will probably be more useful.
| IshKebab wrote:
| > That means that a model with the highest accuracy on a noisy
| dataset will likely be the best model on the clean dataset.
|
| That's exactly the opposite of what this article says.
| alex_hirner wrote:
| Exactly.
|
| The observation that a less faulty model is likely less
| accurate on a noisy validation set than a more faulty model,
| doesn't change the fact that there must be faulty models with
| higher accuracy than a perfect model on a noisy validation
| set.
| black_puppydog wrote:
| I just skimmed both sources you list and Ctrl-F "bias" seems to
| indicate they assume unbiased noise.
|
| Which _might_ be okay depending on the application, but for
| many data collection processes will not be okay. Imagine if one
| group of people 's wealth is systematically under-estimated
| (aka "measured") when training an insurance policy AI. The
| algorithm will then correctly learn the bias in the dataset.
|
| None of this is magic. Data collection is hard. Spotting your
| own biases before they make it into your dataset is a blind-
| spot exercise.
| skrebbel wrote:
| I'm not trying to refute the problem of biased AI data, but I
| do want to understand it better.
|
| You mention an "insurance policy AI". The rest of this thread
| seems to be about image recognition. What's an insurance
| policy AI and how does it work? Is it a thing or a
| hypothetical future thing?
|
| Can bias effects have equally bad effects in image
| recognition? I know about the story of the photo software
| that categorized a man's holiday photos under "gorillas"
| because it was trained only on white people. This is terribly
| insulting, but less terrible than an insurance company
| unfairly overcharging you or refusing you service.
|
| I guess what I'm saying is I can't come up with biased image
| recognition AIs having unfair outcomes that actually affect
| people deeply, and I'm likely missing lots of terrible
| examples, and I'd love if someone can educate me.
|
| EDIT: before people erupt in outrage, I'm not saying that a
| computer telling you that you're a gorilla isn't terrible,
| but in the end it's an indictment of the software, not you.
| whynaut wrote:
| If that software can't distinguish a black man from a
| gorilla, i'm sure it'll have trouble distinguishing two
| black men at least some of the time.
| lanstin wrote:
| Pulling people for extra checks in a security screening use
| case (from an image get an emotional state rating). Not
| hitting pedestrians in a self driving car. Also a lot of
| image scanning applications are being sold as "upload a
| photo of a prospective employee and a get a trust
| worthiness rating".
| 3pt14159 wrote:
| This is akin to how humans learn too, so I'm not surprised that
| it is the case. Little kids learn how to listen to verbal
| commands even when, say, a flight is going by overhead. Most of
| what we learn is in a bit of a mush of data, and I think this
| is a feature not a bug, since it allows the models to be more
| robust.
|
| It may be one of the precursors to intelligence: The ability to
| cognitively filter out unnecessary sensory stimulus in order to
| focus perception to the object or subject of a creature's
| attention.
| [deleted]
| [deleted]
| rsp1984 wrote:
| In classical estimation tasks (state estimation, control, SLAM,
| etc..) techniques for dealing with outlier data are well
| established (RANSAC and it's siblings, M-Estimators, etc..) and
| known as "robust estimation". Depending on how much processing
| you throw at it these techniques can successfully deal with 90%
| corrupt datasets (10% inliers).
|
| For some reason though I've never heard of "robust ML" and I
| always wonder why. The only thing I hear about is increasing NN
| sizes and increasing model complexity. Many smart people are
| working on this so I suppose there's a reason for that but it's
| just not very obvious to me. Can someone with knowledge of the
| matter provide some insight?
| beforeolives wrote:
| Isn't what you're describing simply regularization?
| Jabbles wrote:
| I don't understand how this could have been unknown - surely
| every paper that presents an image recognition system has a small
| section looking at the cases the system got wrong?
|
| Edit: I said "surely", but, no: GoogleNet's paper has no such
| section:
| https://static.googleusercontent.com/media/research.google.c...
| mjburgess wrote:
| It's hard to tell whether experimenters are delusional or
| dishonest; I think it's a mixture in practice.
|
| Researchers have no incentive to refute their own success; and
| neither do their peers who are producing similar "research"
| themselves.
|
| The whole game is to state enough metaphorical propositions
| until they seem concrete.
|
| "New AI interprets conversations and writes stories!" they
| say...
|
| Their evidence? Run it enough times until a subsample appears
| to match our expectations, and show that subsample.
|
| What about all the other counter-examples? That's compared with
| "errors humans make too!!!!"
|
| When, of course, if you look at those counter-examples they
| completely destroy the interpretation of the system as
| understanding anything.
|
| It is ruled out by the data analysis procedure: measurement
| data is fed to algorithms which assume it is sampled randomly.
| The whole purpose of "intelligence" (, understanding, etc.) is
| to uncover the non-random underlying model which _explains_ the
| data distribution. Thus statistical AI simply cannot do what it
| is claimed of it. All we are left to do, for marketing and
| financial reasons, is play a game of metaphors and
| superstition.
|
| Personally I find it infuriating. It's profoundly
| religious/superstitious and a debasement to scientific
| practice: of course, computer scientists aren't scientists; and
| here is where the problem arises. They have no empirical sense
| of what a model of the world _actually is_ ; nor of what a
| "mere statistics of measurement _can do_ ".
| tasogare wrote:
| Marketable names (buzzwords) are very powerful. If we'd
| replace AI by more realistic terms such as "curve fitting",
| "hyperparameters tuning" or "decision tree building" the
| limitations would be very clear from the state, and the hype
| could be kept in check.
|
| A supervisor of mine tried once to make me switch to more AI-
| based thing by telling me my research field was niche, and
| asking if "AI in education" wasn't something bigger and more
| interesting (I replied no). In retrospect I should have asked
| him the question back by rephrasing it like proposed above.
| BlueTemplar wrote:
| I believe the term is "computer".
|
| Especially these days, when human computers have basically
| disappeared as a job.
| hntrader wrote:
| Delusional - absolutely. Such a large proportion fall into
| the trap of iteratively overfitting to the test set without
| much of a care for the basics such as data quality, what
| question to ask, etc. It's a problem.
| dalbasal wrote:
| This is meta but...
|
| Most politically interested people who aren't techies never
| really understood a lot of really important ideas from the
| "technology, power and freedom" section of the library. Consider
| these titles: - "Why Linux, Wikipedia & WWW are
| existence proof for anarcho-communism/-capitalism"
| - "Youtube & Twitter are protocol squatters" - "RSS
| podcasting, the last free online media"
|
| These don't mean anything to most people, and it's hard to get
| them to care. "Net Neutrality" _did_ make its way to politics,
| but in a much bastardized form and most _people_ never understood
| it.. even reporters and such.
|
| OTOH, the political and economic of _datasets_ is instinctively
| understood by non-tech people. The average person (certainly
| reporter) is highly in-tune to the political importance of
| imagenet 's Categories for Bad Person. Ownership of the datasets.
| Control over contents. etc. Often ahead of tech-oriented people.
|
| The person who struggled understanding "free as in speech"
| applied to software sounds like Stallman on chivas when the
| conversation is about datasets. Interesting.
| jonnycomputer wrote:
| Related: https://news.ycombinator.com/item?id=26616454
|
| Data QA is just not sexy, and doesn't get you promotions, or the
| big paper.
| brudgers wrote:
| _Five human reviewers on Amazon Mechanical Turk_
|
| So long as AI best practice is pretending to pay, will pretending
| to work will be the outcome.
|
| At best.
___________________________________________________________________
(page generated 2021-04-04 23:02 UTC)