[HN Gopher] Error-riddled data sets are warping our sense of how...
       ___________________________________________________________________
        
       Error-riddled data sets are warping our sense of how good AI is
        
       Author : djoldman
       Score  : 110 points
       Date   : 2021-04-02 13:49 UTC (2 days ago)
        
 (HTM) web link (www.technologyreview.com)
 (TXT) w3m dump (www.technologyreview.com)
        
       | Veedrac wrote:
       | This article doesn't make any major factual errors, but it's
       | implicating that this is some sort of surprise, and that it's a
       | major impediment, neither or which hold up. While training on
       | noisy datasets is an interesting problem for many practical use-
       | cases, a few percent label errors in these classic basic datasets
       | (MNIST, CIFAR, ImageNet) really doesn't matter that much. The
       | first two are basically toys to check that your system works at
       | all at this point, and even on ImageNet the best models are now
       | more accurate than the original labels.
       | 
       | However, per [1], progress on current ImageNet still correlates
       | with true accuracy. This is in large because we move to harder
       | tests when the easy ones stop working. In part it's also good,
       | because training with label noise forces the use of better-
       | generalizing solutions. The current SOTA is Meta Pseudo
       | Labels[2], which is a particularly clever trick that never even
       | directly exposes the final model to the training data.
       | 
       | [1] Are we done with ImageNet? --
       | https://arxiv.org/abs/2006.07159
       | 
       | [2] Meta Pseudo Labels -- https://arxiv.org/abs/2003.10580
        
         | currio wrote:
         | I think other result from the article (and paper it links to)
         | is with more accurate data simpler models perform better than
         | complicated models.
         | 
         | This is a good observation because as the saying as "Garbage in
         | garbage out" or "Model is only as good as data". It is better
         | to focus on better data than spending time or improving models
         | or overfitting complicated models to inaccurate data
        
           | Guest42 wrote:
           | Hadn't thought of it, but in my experience the "newer" models
           | require a lot more data, this would them become an impediment
           | because there's a certain likelihood the quality goes down
           | with the increased quantity, i.e. measured differently or
           | under different conditions.
        
         | iudqnolq wrote:
         | The article alleges that the error is non-random. There are
         | some specific sorts of errors the humans on mechanical turk
         | made, and the SOTA models get a significant portion of their
         | advantage over earlier models based only on being better at
         | guessing the incorrect labels. To a layperson, this sounds like
         | a different issue to the one you refute. Is this not a big
         | deal?
        
           | Veedrac wrote:
           | Label noise will almost always give you a weaker model, and
           | sometimes it matters whether those are merely random or due
           | to some general confusion (eg. people mistaking crabs for
           | lobsters). But these datasets are best thought of as research
           | tools, not generally directly useful for applied models.
           | 
           | Whereas there are major practical issues with even small or
           | subtle errors in, eg., medical or legal datasets for real-
           | world models, a pure research dataset is generally fine as
           | long as a higher score correlates with a smarter model. If
           | MNIST, a nowadays-trivial handwritten digit recognition
           | dataset, had all its '1's labels swapped with '3's labels, it
           | would pretty much fill its role just as well.
        
         | SpicyLemonZest wrote:
         | It's good to hear that experts have strategies to handle the
         | problem, but I think this will definitely be a surprise to most
         | of the article's readers. I'm a relatively AI-interested
         | layman, and I had assumed that these kinds of datasets had more
         | like 99% accuracy.
        
           | Veedrac wrote:
           | Heh, I guess this is a difference in perspectives. ML has
           | some really shoddy datasets, like really bad, worse than you
           | think I mean.
           | 
           | > The macro-averaged results show that the ratio of correct
           | samples ("C") ranges from 24% to 87%, with a large variance
           | across the five audited datasets. Particularly severe
           | problems were found in CCAligned and WikiMatrix, with 44 of
           | the 65 languages that we audited for CCAligned containing
           | under 50% correct sentences, and 19 of the 20 in WikiMatrix.
           | In total, 15 of the 205 language specific samples (7.3%)
           | contained not a single correct sentence.
           | 
           | Quality at a Glance: An Audit of Web-Crawled Multilingual
           | Datasets -- https://arxiv.org/abs/2103.12028
           | 
           | In the face of stuff like that, a 90+% correct dataset
           | doesn't seem like a big deal.
        
       | bramblerose wrote:
       | https://labelerrors.com/ has an overview of the label errors
       | uncovered in this research.
        
         | IshKebab wrote:
         | Most of the errors seem to be either:
         | 
         | * The label is for something in the image, but there are
         | multiple things in the image.
         | 
         | * Getting animal species wrong.
        
         | Jon_Lowtek wrote:
         | That error correction has errors. Their consensus claims a
         | label "chain link fence" is wrong on an image that includes a
         | chain link fence, albeit as a lesser element of the images
         | composition.
         | 
         | https://labelerrors.com/static/imagenet/val/n03000134/ILSVRC...
         | 
         | It is also interesting that their model guessed "container
         | ship" with the dominant element of the picture being a harbor
         | crane. I think there might be a lot of images in the ImageNet
         | data set that include both a crane and a ship as strong
         | elements, but only have the ship label.
         | 
         | ImageNet is a single label set, not a multi label set and that
         | is a known problem. This paper talks more about it:
         | https://github.com/naver-ai/relabel_imagenet
         | 
         | With large data sets containing millions of images the only way
         | forward seems to have AI keep refining the data, escalating
         | things multiple automatons disagree on to be reviewed by more
         | classifiers, including humans.
        
           | BlueTemplar wrote:
           | So, do they have a higher or lower % of errors than the sets
           | that they are studying? :p
        
       | macspoofing wrote:
       | Wasn't the entire point of machine learning to sift through noisy
       | data and derive useful models or information? Why is this a
       | surprise?
        
         | beforeolives wrote:
         | You need correct labels to train these models. The point of the
         | article is that some of the most popular benchmark datasets
         | have a significant proportion of incorrect labels.
        
           | sdenton4 wrote:
           | Actually, it's ok to have lots of incorrect training labels;
           | the effect is usually just to shift the output probabilities
           | downward.
           | 
           | Where you really want/need accurate labels is the eval set,
           | which the article addresses.
        
             | cgn wrote:
             | I am an author on the paper, and just saw this thread.
             | Thanks for all of your comments. Note this work focuses on
             | errors in test sets, not training labels. Test sets are
             | foundation of benchmarking for machine learning. We study
             | test sets (versus training sets), because training sets are
             | already studied in-depth in confident learning
             | (https://arxiv.org/pdf/1911.00068.pdf) and because before
             | this paper, no one had studied (at scale, for lots of
             | datasets) if these test sets had errors in the first place
             | (beyond just looking at the well-known erroneous ImageNet
             | dataset, for example) and do these test sets affect
             | benchmarks. The answer is, yes, if the noise prevalence is
             | fairly high (above 10%). The takeaway, is you must
             | benchmark models on clean test sets, regardless of noise in
             | your training set, if you want to know how your model will
             | perform in the real world.
        
       | supernova87a wrote:
       | Is the April Fool's joke of this article (dated April 1) the fact
       | that the error rate is 5.8% for ImageNet? Because last I heard,
       | it wasn't exactly breaking news that any training dataset has
       | some intrinsic error rate, or that some error rate means that the
       | models are "flawed".
       | 
       | Maybe I would find this more interesting if the question were,
       | "how good does good enough have to be"?
        
         | PeterisP wrote:
         | If a good ImageNet model gets 88-90% accuracy on the test set
         | (so, 10%-12% error rate) then if it turns out that 5.8% of the
         | test set "gold standard" labels are actually wrong is a very
         | big deal, it would mean that half of the measured differences
         | aren't actually real but an artifact of the original human
         | labelers' mistakes.
        
         | iudqnolq wrote:
         | The interesting point they make is that the error is non-
         | random, and a significt part of why complex models are better
         | than simple models is they're better at guessing the wrong
         | labels.
        
       | tmabraham wrote:
       | People are studying ways to train accurate labels with noisy
       | labels (see the paragraph in this article:
       | https://tmabraham.github.io/blog/noisy_imagenette#Prior-rese...)
       | 
       | Also, technically accuracy as a metric is robust to noise
       | (https://arxiv.org/abs/2012.04193). That means that a model with
       | the highest accuracy on a noisy dataset will likely be the best
       | model on the clean dataset. So these noisy datasets can still be
       | very useful for the development of deep learning models. In fact
       | if you look at the tradeoffs between getting larger datasets that
       | have noisy labels vs. smaller datasets will clean labels (since
       | good annotation is expensive!), the noisy large-scale dataset
       | will probably be more useful.
        
         | IshKebab wrote:
         | > That means that a model with the highest accuracy on a noisy
         | dataset will likely be the best model on the clean dataset.
         | 
         | That's exactly the opposite of what this article says.
        
           | alex_hirner wrote:
           | Exactly.
           | 
           | The observation that a less faulty model is likely less
           | accurate on a noisy validation set than a more faulty model,
           | doesn't change the fact that there must be faulty models with
           | higher accuracy than a perfect model on a noisy validation
           | set.
        
         | black_puppydog wrote:
         | I just skimmed both sources you list and Ctrl-F "bias" seems to
         | indicate they assume unbiased noise.
         | 
         | Which _might_ be okay depending on the application, but for
         | many data collection processes will not be okay. Imagine if one
         | group of people 's wealth is systematically under-estimated
         | (aka "measured") when training an insurance policy AI. The
         | algorithm will then correctly learn the bias in the dataset.
         | 
         | None of this is magic. Data collection is hard. Spotting your
         | own biases before they make it into your dataset is a blind-
         | spot exercise.
        
           | skrebbel wrote:
           | I'm not trying to refute the problem of biased AI data, but I
           | do want to understand it better.
           | 
           | You mention an "insurance policy AI". The rest of this thread
           | seems to be about image recognition. What's an insurance
           | policy AI and how does it work? Is it a thing or a
           | hypothetical future thing?
           | 
           | Can bias effects have equally bad effects in image
           | recognition? I know about the story of the photo software
           | that categorized a man's holiday photos under "gorillas"
           | because it was trained only on white people. This is terribly
           | insulting, but less terrible than an insurance company
           | unfairly overcharging you or refusing you service.
           | 
           | I guess what I'm saying is I can't come up with biased image
           | recognition AIs having unfair outcomes that actually affect
           | people deeply, and I'm likely missing lots of terrible
           | examples, and I'd love if someone can educate me.
           | 
           | EDIT: before people erupt in outrage, I'm not saying that a
           | computer telling you that you're a gorilla isn't terrible,
           | but in the end it's an indictment of the software, not you.
        
             | whynaut wrote:
             | If that software can't distinguish a black man from a
             | gorilla, i'm sure it'll have trouble distinguishing two
             | black men at least some of the time.
        
             | lanstin wrote:
             | Pulling people for extra checks in a security screening use
             | case (from an image get an emotional state rating). Not
             | hitting pedestrians in a self driving car. Also a lot of
             | image scanning applications are being sold as "upload a
             | photo of a prospective employee and a get a trust
             | worthiness rating".
        
         | 3pt14159 wrote:
         | This is akin to how humans learn too, so I'm not surprised that
         | it is the case. Little kids learn how to listen to verbal
         | commands even when, say, a flight is going by overhead. Most of
         | what we learn is in a bit of a mush of data, and I think this
         | is a feature not a bug, since it allows the models to be more
         | robust.
         | 
         | It may be one of the precursors to intelligence: The ability to
         | cognitively filter out unnecessary sensory stimulus in order to
         | focus perception to the object or subject of a creature's
         | attention.
        
         | [deleted]
        
       | [deleted]
        
       | rsp1984 wrote:
       | In classical estimation tasks (state estimation, control, SLAM,
       | etc..) techniques for dealing with outlier data are well
       | established (RANSAC and it's siblings, M-Estimators, etc..) and
       | known as "robust estimation". Depending on how much processing
       | you throw at it these techniques can successfully deal with 90%
       | corrupt datasets (10% inliers).
       | 
       | For some reason though I've never heard of "robust ML" and I
       | always wonder why. The only thing I hear about is increasing NN
       | sizes and increasing model complexity. Many smart people are
       | working on this so I suppose there's a reason for that but it's
       | just not very obvious to me. Can someone with knowledge of the
       | matter provide some insight?
        
         | beforeolives wrote:
         | Isn't what you're describing simply regularization?
        
       | Jabbles wrote:
       | I don't understand how this could have been unknown - surely
       | every paper that presents an image recognition system has a small
       | section looking at the cases the system got wrong?
       | 
       | Edit: I said "surely", but, no: GoogleNet's paper has no such
       | section:
       | https://static.googleusercontent.com/media/research.google.c...
        
         | mjburgess wrote:
         | It's hard to tell whether experimenters are delusional or
         | dishonest; I think it's a mixture in practice.
         | 
         | Researchers have no incentive to refute their own success; and
         | neither do their peers who are producing similar "research"
         | themselves.
         | 
         | The whole game is to state enough metaphorical propositions
         | until they seem concrete.
         | 
         | "New AI interprets conversations and writes stories!" they
         | say...
         | 
         | Their evidence? Run it enough times until a subsample appears
         | to match our expectations, and show that subsample.
         | 
         | What about all the other counter-examples? That's compared with
         | "errors humans make too!!!!"
         | 
         | When, of course, if you look at those counter-examples they
         | completely destroy the interpretation of the system as
         | understanding anything.
         | 
         | It is ruled out by the data analysis procedure: measurement
         | data is fed to algorithms which assume it is sampled randomly.
         | The whole purpose of "intelligence" (, understanding, etc.) is
         | to uncover the non-random underlying model which _explains_ the
         | data distribution. Thus statistical AI simply cannot do what it
         | is claimed of it. All we are left to do, for marketing and
         | financial reasons, is play a game of metaphors and
         | superstition.
         | 
         | Personally I find it infuriating. It's profoundly
         | religious/superstitious and a debasement to scientific
         | practice: of course, computer scientists aren't scientists; and
         | here is where the problem arises. They have no empirical sense
         | of what a model of the world _actually is_ ; nor of what a
         | "mere statistics of measurement _can do_ ".
        
           | tasogare wrote:
           | Marketable names (buzzwords) are very powerful. If we'd
           | replace AI by more realistic terms such as "curve fitting",
           | "hyperparameters tuning" or "decision tree building" the
           | limitations would be very clear from the state, and the hype
           | could be kept in check.
           | 
           | A supervisor of mine tried once to make me switch to more AI-
           | based thing by telling me my research field was niche, and
           | asking if "AI in education" wasn't something bigger and more
           | interesting (I replied no). In retrospect I should have asked
           | him the question back by rephrasing it like proposed above.
        
             | BlueTemplar wrote:
             | I believe the term is "computer".
             | 
             | Especially these days, when human computers have basically
             | disappeared as a job.
        
           | hntrader wrote:
           | Delusional - absolutely. Such a large proportion fall into
           | the trap of iteratively overfitting to the test set without
           | much of a care for the basics such as data quality, what
           | question to ask, etc. It's a problem.
        
       | dalbasal wrote:
       | This is meta but...
       | 
       | Most politically interested people who aren't techies never
       | really understood a lot of really important ideas from the
       | "technology, power and freedom" section of the library. Consider
       | these titles:                 - "Why Linux, Wikipedia & WWW are
       | existence proof          for anarcho-communism/-capitalism"
       | - "Youtube & Twitter are protocol squatters"       - "RSS
       | podcasting, the last free online media"
       | 
       | These don't mean anything to most people, and it's hard to get
       | them to care. "Net Neutrality" _did_ make its way to politics,
       | but in a much bastardized form and most _people_ never understood
       | it.. even reporters and such.
       | 
       | OTOH, the political and economic of _datasets_ is instinctively
       | understood by non-tech people. The average person (certainly
       | reporter) is highly in-tune to the political importance of
       | imagenet 's Categories for Bad Person. Ownership of the datasets.
       | Control over contents. etc. Often ahead of tech-oriented people.
       | 
       | The person who struggled understanding "free as in speech"
       | applied to software sounds like Stallman on chivas when the
       | conversation is about datasets. Interesting.
        
       | jonnycomputer wrote:
       | Related: https://news.ycombinator.com/item?id=26616454
       | 
       | Data QA is just not sexy, and doesn't get you promotions, or the
       | big paper.
        
       | brudgers wrote:
       | _Five human reviewers on Amazon Mechanical Turk_
       | 
       | So long as AI best practice is pretending to pay, will pretending
       | to work will be the outcome.
       | 
       | At best.
        
       ___________________________________________________________________
       (page generated 2021-04-04 23:02 UTC)