[HN Gopher] What Data Can't Do
___________________________________________________________________
What Data Can't Do
Author : RiderOfGiraffes
Score : 144 points
Date : 2021-03-31 12:45 UTC (10 hours ago)
(HTM) web link (www.newyorker.com)
(TXT) w3m dump (www.newyorker.com)
| temp8964 wrote:
| This article is totally gibberish. It's a terrible mixture of
| many unrelated things. Just because those things all have
| something to do with data (anything can be presented in numeric
| form), it does not make their issues are about data.
|
| First, the Tony Blair example is not about data. It is a failure
| of government planning. It's wrong politics and wrong economy.
|
| The G.D.P. example is laughable. G.D.P. is never intended to be
| used to compare individual cases. What kind of nonsense is this?
|
| And the IQ example. The results are backed by decades of
| extensive studies. The author thinks picking a few critics can
| invalidate the whole field. And look! The white supremacist who
| gave Asians the highest IQ, what a disgrace to his own ideology.
|
| Many more. I feel it's kind of tactic to produce this kind of
| article. Just glue a bunch of stuff, throw together with
| somethings seem to be related, bam, you got an article.
| cloogshicer wrote:
| Gotta disagree here. This article is acknowledging a pattern,
| that data is misused in many different areas.
|
| I think the problem goes even deeper, which is a
| misunderstanding of the scientific method. Good discussion
| about this topic here:
| https://news.ycombinator.com/item?id=26122712
| temp8964 wrote:
| "data is misused in many different areas" is not a valuable /
| informative point.
|
| There are many wrongs seem to have something to do with data,
| but in fact they are not.
|
| Like socialist economy planning will eventually fail, but
| then you would say they misused data. It seems relevant, but
| misusing the data is not the real cause of their failure at
| all.
| not2b wrote:
| Replace "socialist" with "large company". The companies
| gather data, establish metrics, and manage to those
| numbers, and often bad things result. Ever been in a
| company where some internal support function goes to hell
| because its top manager's bonus depends on a metric, and
| they can improve that metric by refusing to support the
| users (find excuses to close IT support calls without
| fixing the issue, etc).
| temp8964 wrote:
| Yes. But then some would say the company fail because it
| "misused data", but it was not the real cause of the
| company's failure. Any project involves using data could
| blame the failure on "misused data" which is an useless
| conclusion.
| tppiotrowski wrote:
| "once a useful number becomes a measure of success, it ceases to
| be a useful number"
|
| Two other unintended consequences of incentives I learned in
| economics:
|
| 1. Increasing fuel efficiency does not reduce gas consumption.
| People just use their car more often.
|
| 2. Asking people to pay-per-bag for garbage pickup resulted in
| people dumping trash on the outskirts of town.
|
| _Edit: Did more research after downvote. Definitely double check
| things you learn in college_
|
| 1. The jury is still out:
| https://en.wikipedia.org/wiki/Jevons_paradox
|
| 2. Seems false
| https://en.wikipedia.org/wiki/Pay_as_you_throw#Diversion_eff...
| kerblang wrote:
| They might as well have included the granddaddy example (as the
| age of computing goes): The vietnam war, mcnamara & body counts.
| neonate wrote:
| https://archive.is/ynOm2
| sradman wrote:
| From the ungated archive [1]:
|
| > Whenever you try to force the real world to do something that
| can be counted, unintended consequences abound. That's the
| subject of two new books about data and statistics: "Counting:
| How We Use Numbers to Decide What Matters", by Deborah Stone,
| which warns of the risks of relying too heavily on numbers, and
| "The Data Detective", by Tim Harford, which shows ways of
| avoiding the pitfalls of a world driven by data.
|
| Data is a powerful feedback mechanism that can enable system
| gamification; it can also expose it. The evil is extracting
| unearned value from a system through gamification not the tools
| employed to do so. I'm looking forward to reading both books.
|
| [1] https://archive.is/ynOm2
| kazinator wrote:
| > _doctors would be given a financial incentive to see patients
| within forty-eight hours._
|
| Not measuring that from the first contact that the patient made
| is simply dishonest.
|
| "Call back in three days to make the appointment, so I can claim
| you were seen within 48 hours, and therefore collect a bonus"
| amounts to fraud because the transaction for obtaining that
| appointment has already been initiated.
|
| I mean, they could as well just give the person the appointment
| in a secret, private appointment registry, and then copy the
| appointments from that registry into the public one in such a way
| that it appears most of the appointments are being made within
| the 48 hour window. Nothing changes, other than that bonuses are
| being fraudulently collected, but at least the doctor's office
| isn't being a dick to the patients.
| Eridrus wrote:
| It's really hard to design a not game-able metric. The problem
| here seems to be that doctors are under-provisioned for some
| reason, and so long wait times are a form of load shedding for
| the system. Without addressing this core issue, which
| individual clinics have little control over because they are
| generally boxed in by regulations over who can administer
| medical care, except to rush appointments (which they're
| probably already doing), there's not much they can do to solve
| the problem, so all they can do is try to game the rules or not
| get the bonuses.
| closeparen wrote:
| Doctors don't get to bill for idle time. Being less than
| fully utilized is leaving money on the table. The idea here
| is presumably to compensate them for leaving gaps in their
| schedules.
| [deleted]
| DharmaPolice wrote:
| It may amount to fraud but in the context of that service, no
| record was kept of calls not resulting in an appointment. My
| wife was a receptionist at a GPs around the time the article
| mentions and in some cases it was worse than that - if you
| phoned and asked for an appointment, if they couldn't give you
| one within 48 hours they wouldn't offer one at all - telling
| you to call back later / the next day.
|
| Although the New Yorker piece has leaned on the bonus angle the
| way it was discussed publicly was that doctors weren't allowed
| to offer you appointments outside the 48 hour window [0].
|
| It was a very silly interpretation of the rules, but I think
| GPs felt it was too rigid and therefore stuck to the letter
| rather than the spirit.
|
| 0 - http://news.bbc.co.uk/1/hi/health/3682920.stm
| ectopod wrote:
| Not to defend the system, but I don't think they were trying to
| fraudulently get her 48 hour payment. Rather, if they accepted
| advance bookings (which was, after all, the old system) then
| almost nobody would be seen in 48 hours. They could have
| offered advance bookings for follow-ups, but since most
| appointments are taken by the sickest people, many of them will
| be follow-ups, so this probably wouldn't have helped.
|
| If you want to reduce the queuing time in a system you need to
| reduce the processing time (i.e. the duration of an
| appointment) or increase the number of servers (i.e. doctors).
| You can't do it by edict.
| bluGill wrote:
| On hindsight. However did you think of that before hearing of
| the problem. Even if you did, can you think of - without much
| time to think - how every possible metric I can propose on
| every possible topic.
|
| Tony Blair was trying to solve a real problem that needed
| solving. That he opened a different problem is something that
| we should think of as normal, and not blame him for trying to
| solve the original problem. The question should be how to we
| change the metric until the unintended consequences are ones we
| can live with. That will probably take more than a lifetime to
| work out.
|
| Note that there will be a lot of debate. There are predicted
| consequences that don't happen in the real world for whatever
| reason. There are consequences that some feel we can live with
| that others will not accept. Politics is messy.
| wodenokoto wrote:
| The article seems to stop pretty early, as if something is
| missing.
|
| It's an anecdote about a government incentive to have doctors see
| patients within 48 hours causing doctors to refuse scheduling
| patients later than 48 hours in order to get the incentive bonus.
|
| This is not an example of limits of data, but an example of
| perverse incentives.
| justinclift wrote:
| > It's an anecdote about a government incentive to have doctors
| see patients within 48 hours causing doctors to refuse
| scheduling patients later than 48 hours in order to get the
| incentive bonus.
|
| That part is probably just the first 1/8th or so of the article
| (rough guess). Sounds like it was cut short for you?
| polytely wrote:
| Did you use reader mode? because I noticed that when I use
| firefox's reader mode it will cut off part of the article on
| the New Yorker site.
| strathmeyer wrote:
| We clicked on the link and were presented with two
| paragraphs. New Yorker articles usually don't show up
| correctly, I don't know why they allowed to be posted here.
| Acting like we don't know how to read a webpage is
| gaslighting.
| polytely wrote:
| Well that wasn't my intention at all, just a problem i
| encountered a few hours before I read your comment so it
| was fresh on my mind.
| Fishkins wrote:
| When I first opened the article in Firefox, I only saw the
| first two paragraphs (same as the OP). This was true whether
| it was in reader mode or not. I opened it in Chrome and saw
| the whole article.
|
| I just tried opening it in Firefox now (a couple hours later)
| and I see the whole article. If I switch to reader mode I do
| see it's truncated about halfway through, but I think that's
| a separate issue from what the OP was seeing.
| wodenokoto wrote:
| On iOS. Initially used reader mode, but switched because it
| seemed cut off.
|
| But also without reader mode I can't see more than the NHS
| anecdote.
| jabroni_salad wrote:
| When a measure becomes a target, it ceases to be a good
| measure.
|
| This case could be said to be creating misleading data. If the
| doctor's offices aren't recording appointments more than 48
| hours in advance, the System is losing visibility on the total
| number of people who want appointments. Every office will
| appear to be 100% efficient even though there is effectively
| still an invisible waiting list.
| [deleted]
| jplr8922 wrote:
| Confusing performance metrics and strategical objective is not a
| data problem, it is a human problem. It happens to a lot of
| people outside the usual Blair-WhiteNationalist-IQ crowd. I do
| not think that advanced technical knowledge in ML or stats is
| required to avoid this mistake ; it is the ability to perform
| valid counterfactuals statements.
|
| A good example of what I mean can be found on wikipedia :
|
| His instinctive preference for offensive movement was typified by
| an answer Patton gave to war correspondents in a 1944 press
| conference. In response to a question on whether the Third Army's
| rapid offensive across France should be slowed to reduce the
| number of U.S. casualties, Patton replied, "Whenever you slow
| anything down, you waste human lives."[103]
|
| https://en.wikipedia.org/wiki/George_S._Patton
|
| Here, US general Patton is not confounding a performance metric
| (number of casualities) with strategic goal (winning the war).
| His counterfactual statement could be that ''if we slow things
| down, you are simply delaying future battles and increase the
| _total_ number of casualties in order to achieve victory ''.
|
| I'm not suprised at Blair decision. When we choose leaders, do we
| favor long term strategic thinkers, or opportunistic pretty
| faces?
| lmm wrote:
| Does the use of statistics actually amplify misunderstanding, or
| merely reveal misunderstandings that were already there? In any
| of these examples given - predicting rearrests, infant mortality,
| or so on - it's hard to imagine that someone not using numbers
| would have reached a conclusion that was any closer to the truth.
|
| Data has its limits, but the solution is usually - maybe even
| always - more data, not less.
| mjburgess wrote:
| It's pretty trivial to predict things without "data". Data just
| means using some measurement system to obtain measurements of
| some target phenomenon. Many targets cannot be measured, or
| have not occurred to be measured.
|
| Reasoning counter-factually is trivial: What would happen if I
| dropped this object in this place in which an object, of this
| kind, has never been dropped before?
|
| Well apply relevant models, etc. and "the object falls, rolls,
| pivots, etc.".
|
| This is reasoning-forward _from_ models, rather than backwards
| from data. And it 's the heart of anything that makes any
| sense.
|
| Data is not a model and provides no model. The "statistics of
| mere measurement" is a dangerously utopian misunderstanding of
| what data is. The world does not tell you, via measurement,
| what it is like.
| kkoncevicius wrote:
| This article is not so much about the data, as it is about rules
| and thresholds used to divide that data into groups.
| mdip wrote:
| Kind of love the initial story in the article about 48-hour wait
| times.
|
| I had a stint writing conferencing software for quite some time,
| and every once in a while we'd come across a customer requirement
| that had capabilities which were obvious to us developers "would
| be misused". As a result, we did the "Thinking, Fast and Slow"
| pre-mortem to help surface other ways that the system could be
| attacked (along with what we would do to prevent it and how it
| impacted the original feature).
|
| If you create something, and open it to the public, and there's
| _any way_ for someone to misuse it for financial incentive
| (especially if they can do so without consequence), it _will_ be
| misused. In fact, depending on the incentive, you may find that
| the misuse becomes the only way that the service is used.
| kazinator wrote:
| When the doctor's office is inundated with patient visits, you
| cannot fix scheduling back-logs by fiddling with the scheduling
| algorithm, no matter how much data you have.
|
| Say the calendar is initially empty and 1000 people want to see
| the doc, right now. You can fill them all into the calendar, or
| you can play games that solve nothing, like only filling
| tomorrow's schedule with 10 people, asking 990 of them to call
| back. That doesn't change the fact that it takes 100 days to
| see 1000 patients. All it does is cause unfair delays; the
| original 1000 can be pre-empted by newcomers who get earlier
| appointments since their place in line is not being maintained.
| strathmeyer wrote:
| How do we read more than the initial story? Do we have to pay
| to read it? There is no indication on the webpage there is more
| than two paragraphs other than the advertisement for the
| author's book.
| lifeisstillgood wrote:
| conferencing as in 'ComicCon' or 'Zoom'? Can you give an
| example?
| mdip wrote:
| Haha, hadn't even thought of that -- Conferencing as in
| developing bespoke (and some white-label) software for
| organizations deploying Office Communications Server (and
| R2), Lync and ultimately Skype for Business (I do a little
| Teams work these days but I am focused on other areas,
| presently).
| foolinaround wrote:
| around paywall : https://archive.vn/ynOm2
| tomrod wrote:
| An absolutely fantastic article that captures my concerns as a
| user, purveyor, and automater of systems that help with numbers.
| I'm always very cautious regarding the jump from numbers
| informing to numbers deciding.
| visarga wrote:
| Data is very limited, indeed. We can't predict outside the
| distribution, or unrelated events (without a causal link), or
| random events in the future. We should be humble about the limits
| of data.
| _rpd wrote:
| Sure, but coding a human-equivalent response to such events is
| trivial (because no one and nothing responds well to such
| events).
| ppod wrote:
| This author has published a couple of articles like this at the
| New Yorker They all have this in common: the author works through
| some interesting and in some ways unusual cases where data or
| statistics have been improperly or naively applied, with some
| social costs. I really enjoy the articles themselves.
|
| Then the New Yorker packages it up with a cartoon and a headline
| and subheadline like "Big Data: When will it eat our children?"
| or "Numbers: Do they even have souls?", and serves it up to their
| technophobic audience in a palatable way.
|
| https://www.newyorker.com/contributors/hannah-fry
| dsaavy wrote:
| > *Numbers don't lie, except when they do. Harford is right to
| say that statistics can be used to illuminate the world with
| clarity and precision. They can help remedy our human
| fallibilities. What's easy to forget is that statistics can
| amplify these fallibilities, too. As Stone reminds us, "To
| count well, we need humility to know what can't or shouldn't be
| counted."*
|
| I do have a problem with her conclusion here. Are numbers
| really lying if it's actually an incorrect data collection
| method or conflicting definitions of criteria for generation of
| certain numbers (like the example used in the second to last
| paragraph)? She seems to be pointing out a more important fact,
| which is that people don't question underlying data, how it was
| collected, and the choices those data collectors made when
| making a data set. People tend to take data and conclusions
| drawn from it as objective realities, when in reality data is
| way more subjective.
| lupire wrote:
| > Are numbers really lying if it's actually an incorrect data
| collection method or conflicting definitions of criteria for
| generation of certain numbers
|
| Obviously it's a figurative metaphor, but it's pretty clearly
| a case of "this supposedly objective factual calculation is
| presenting an untruth."
| not2b wrote:
| One of the points is that the act of collecting the numbers
| and making decisions based on them can change the underlying
| behavior. The numbers can be perfectly correct (how many
| cases does the IT department get? How long does it take on
| average to resolve the issue?). The goal can be correct (we
| want to get issues resolved faster). But as soon as you try
| to manage people based on those perfectly valid numbers, but
| things often happen.
| quietbritishjim wrote:
| Hannah Fry is mathematics communicator working in quite a few
| other media. She's been on a couple of BBC documentaries, and
| on a few videos on the Numberphile YouTube channel (which is
| also very good regardless of who's on it)
| canadianwriter wrote:
| Data always needs to be paired with empathy. ML/AI simply doesn't
| have empathy so it will always be missing a piece of the overall
| pie.
|
| Let AI crunch the numbers, but combine it with a human who can
| understand the "why" of things and you can really kick butt.
| gnulinux wrote:
| I agree with you, although, unfortunately, most -- if not all
| -- engineers I know would respond to this by complaining about
| how "a human who can understand the 'why'" cannot be automated.
| williesleg wrote:
| solve covid, not a vaccine but a treatment, those are forbidden.
| Ozzie_osman wrote:
| Data is not a substitute for good judgment, for empathy, for
| proper incentives.
|
| The article focuses on governments and bureaucracies but there's
| no better example than "data-driven" tech companies, as we A/B
| test our key engagement metrics all the way to soulless products
| (with, of course, a little machine learning thrown in to juice
| the metrics).
|
| I wrote about this before:
| https://somehowmanage.com/2020/08/23/data-is-not-a-substitut...
| gen220 wrote:
| I've written the same sentence before! This is so cool! pardon
| the wall of text.
|
| Here's my thesis, curious to hear your thoughts.
|
| At some time around 2005, when efficient persistence and
| computation became cheap enough that any old f500-corp could
| afford to endlessly collect data forever, something happened.
|
| Before 2005, if a company needed to make a big corporate
| decision, there was _some_ data involved in making the
| decision, but it was obviously riddled with imperfections and
| aggregations biases.
|
| Before 2005, executives needed to be seasoned by experience, to
| develop this thing we call "Good Judgement", that allows them
| to make productive inferences from a paucity of data. The
| Corporate Hierarchy was a contest for who could make the best
| inferences.
|
| Post-2005, data collection is ubiquitous. Individuals and
| companies realized that you don't need to pay people with
| experience any more, you can simply collect better data, and
| outsource decision-making to interpretations of this data. The
| corporate hierarchy now is all about how can gather the "best"
| data, where "best" means grow the money pile by X% this
| quarter.
|
| "Good Judgement" used to be expected from the CEO, down to at
| least 1-3 levels of middle management above the front-line
| people. Now, it appears (to me) to be mostly a feature of the
| C-Suite and Boards, and it's disappeared elsewhere. Long-term,
| high-performing companies seem to have a more diffused sense of
| good judgement. But these are rare. maybe they always have
| been?
|
| Anyways, as we agree, this has a tendency to lead in
| problematic directions. Here's my thesis on "why".
|
| Fundamentally, any "data" is reductive of human experience.
| It's like a photograph that captures a picture by excluding the
| rest of the world.
|
| Few people seem to understand this analogy, because they think
| photographs are the ultimate record of an event. Lawyers
| understand this analogy. With the right framing, angle,
| lighting (and of course, with photoshop), you can make a
| photograph tell any story you want.
|
| It's the same issue with data, arguably worse since we don't
| have a set of standard statistics. We have no GAAP-equivalent
| for data science (yet?).
|
| Our predecessors understood that data was unreliable, and
| compensated for this fact by selecting for "Good Judgement".
| The modern mega-corps demonstrate that we don't have a good
| understanding of this today, evidenced by religious "data-
| driven" doctrine, as you describe.
|
| People will say "hey! at least some data is better than no
| data!", to which I'll say data is useless and even harmful in
| lieu of capable interpreters. In 2021, have an abundance of
| data, but a paucity of people who are capable of critical
| interpretation thereof.
|
| I don't know if it's a worse situation than we had 20 years
| ago. But it's definitely a different situation, that requires a
| new approach. I think people are taking notice of it, so I'm
| hopeful.
| cloogshicer wrote:
| Thank you for writing this, I enjoyed reading it and largely
| agree.
| mminer237 wrote:
| I think it's almost worse in tech because it largely works. If
| the government sets a flawed metric, their real goal of
| pleasing their constituents has failed and theoretically they
| either have to fix it or lose political support.
|
| But in tech, if your goal is just to make money, soullessly
| following data will often get you there, to the detriment of
| everyone else. Clickbait headlines will get you more views.
| Full-page popup ads will get you more ad clicks/newsletter
| subscriptions. Microtransactions will get you more sales.
| Gambling mechanics will get you more microtransactions.
|
| You can say it's a flawed metric, but I think in the end, most
| people just actually care more about making money than they do
| about building a good product.
| carlosf wrote:
| I am increasingly worried with people applying ML in everything
| without any rigour.
|
| Statical inference generally only works well in very specific
| conditions:
|
| 1 - You know the distribution of the phenomenon under study (or
| make an explicit assumption and assume the risk of being wrong)
|
| 2 - Using (1), you calculate how much data you need so you get an
| estimation error below x%
|
| Even though most ML models are essentially statistics and have
| all the same limitations (issues with convergence, fat tailed
| distributions, etc...) it seems the industry standard is to
| pretend none of that exists and hope for the best.
|
| IMO the best moneymaking opportunities in the decade will involve
| exploiting unsecured IOT devices and naive ML models, we will
| have plenty of those.
| currymj wrote:
| i think this actually gets at what makes applied ML distinct
| from statistics as a practice, even though there is a ton of
| overlap.
|
| statisticians make assumptions 1 and 2, and think of themselves
| as trying to find the "correct" parameters of their model.
|
| people doing applied ML typically assume they don't know 1
| (although they might implicitly make some weak assumptions like
| sub-gaussian to avoid fat tails, etc.) and also typically don't
| care about being able to do 2. and they don't care about their
| parameters; in a sense to an ML practitioner, every parameter
| is a nuisance parameter.
|
| instead you assume you have some reliable way of evaluating
| performance on the task you care about -- usually measuring
| performance on an unseen test set. as long as this is actually
| reliable, then things are fine.
|
| but you are right that in the face of a shifting distribution
| or an adversary crafting bad inputs, ML models can break down
| -- but there is actually a lot of research on ways to deal with
| this, which will hopefully reach industry sooner rather than
| later.
| QuesnayJr wrote:
| "Every parameter is a nuisance parameter" is a great way to
| put it.
| RobinL wrote:
| Yes - this is pretty much exactly how I explain the
| difference between machine learning and statistics.
|
| Despite using similar models, the expertise required for
| 'doing statistics' (statistical inference) is actually very
| different from machine learning. Machine learning fits into
| the 'hacker mentality' well - try stuff out see what works.
| To do statistical inference effectively, you really do need
| to spend time learning the theory. They both require deep
| skills - but the skills are surprisingly different
| considering it's often the same underlying model.
| nickforr wrote:
| But without some statistical knowledge, isn't there a risk
| of a lack of understanding about the robustness of "what
| works"?
| splithalf wrote:
| Statistical knowledge doesn't remove that risk. The
| extent to which it even lowers the risk is a question
| that could be answered empirically.
| RobinL wrote:
| yeah, agreed - a good understanding of the model's
| statistical assumptions can often help you make the model
| more robust and also give you ideas for what types of
| feature engineering are likely to work.
| carlosf wrote:
| I disagree, every ML model has some implicit statistical
| assumption, which is often not well understood by
| practitioners.
|
| At minimum you must assume your underlying process is not fat
| tailed. If it is, then your training/validation/test data
| might never be enough to make reliable predictions and your
| model might break constantly in prod.
|
| BTW shifting distributions and fat tailed distributions are
| sort of equivalent, at least mathematically.
| currymj wrote:
| I don't disagree with any of that, but I still think a
| responsible, clear-thinking ML practitioner can avoid
| having to assume the form of the data-generating process,
| depending on their application.
|
| In some cases if you care about PAC generalization bounds,
| it's even the case that the bounds do actually hold for all
| possible distributions.
| dumb1224 wrote:
| I think it's more meaningful to have the discussion in a
| specific problem domain since statistical inference or ML
| are just tools to better model a problem / phenomenon.
| The domain (prior) knowledge -- everything else that's
| not stats / ML, are the keys to build a more robust
| model. Leave the problem domain out we are left just with
| pure mathematical theories and the points can only be
| proved by simulated data.
| 6gvONxR4sf7o wrote:
| > instead you assume you have some reliable way of evaluating
| performance on the task you care about -- usually measuring
| performance on an unseen test set. as long as this is
| actually reliable, then things are fine.
|
| This is the part that often fails in practice. Think of all
| the benchmarks that show superhuman performance and compare
| that to how good those same models really aren't.
| Constructing a good set of holdouts to evaluate on is really
| hard and gets back to similar issues. In practice, doing what
| you're describing reliably (in a way that actually implies
| you should have confidence in your model once you roll it
| out) is rarely as simple as holding out some random bit of
| your dataset out and checking performance on it.
|
| On the other hand, what you often see is people just holding
| out a random bunch of rows.
| erichahn wrote:
| Isn't the point of ML exactly that you don't know the
| underlying distribution? How is this ever assumed in any way?
| ML is not parametric statistics.
| contravariant wrote:
| Well, all optimization problems are equivalent to a maximum
| likelihood estimate for a corresponding probability
| distribution so you may make more implicit assumptions than
| you think.
|
| Typical ML methods just have a _huge_ distribution space that
| can fit almost anything from which they pick just 1 option.
| This has two downsides:
|
| Since your distribution space is several times too large by
| design you lose the ability to say anything useful about the
| accuracy of your estimate, other than that it is not the only
| option _by far_.
|
| Since you must pick 1 option from your parameter space you
| may miss slightly less likely explanations that may still
| have huge consequences, which means your models tend to end
| up overconfident.
| nonameiguess wrote:
| (Some) ML is non-parametric, but there are always some
| questions you need to be able to answer about your data. At
| bare minimum, is the generating process ergodic, what is the
| error of your measurement procedure, how representative of
| the true underlying distribution is your sampling procedure?
| All use of data should start with some exploratory analysis
| before you ever get to the modeling stage.
|
| Once you have a model, at minimum understand how to tune for
| the tradeoffs of different types of error and don't naively
| optimize for pure accuracy. At the obvious extremes, if
| you're trying to prevent nuclear attack, false negatives are
| much more costly than false positives, if you're trying to
| figure out whether to execute someone for murder, false
| positives are much more costly than false negatives.
| Understand the relative costs of different types of error for
| whatever you're trying to predict and proceed accordingly.
| rademacher wrote:
| The problem is high dimensions knowing the distribution or even
| characterizing it fully with data is incredibly difficult
| (curse of dimensionality). I think the real assumption in ML is
| just that there is some low dimensional space that
| characterizes the data well and ML algorithms find these
| directions where the data is constant.
| iagovar wrote:
| ML looks (for many peole) like a way to circunvent your grumpy
| statiscian saying that the underlying data is worthless and/or
| you should focus on getting the data pipeline done properly for
| a logit model on your churn rate.
| analog31 wrote:
| "Scientist free science," -- being able to optimize systems
| without understanding them, has been a dream of the business
| world since the dawn of time. There's always been a market
| for cookbook recipes that automate the collection of data,
| and interpretation of results. Before ML, there were "design
| of experiments," and "statistical quality control."
| carlmr wrote:
| >Before ML, there were "design of experiments," and
| "statistical quality control."
|
| Statistical quality control, at least the way I know it, is
| very useful in finding problems in your process. I'm also
| not sure how this fits with your premise. It's about
| optimizing systems by first finding out where to look, and
| then looking there in detail with expert knowledge, i.e.
| deep understanding of your system.
| analog31 wrote:
| I'm definitely with you there, but I've also seen the
| side of it where it turns into a cargo cult and runs
| headlong into the replication crisis.
|
| Perhaps the good thing is that as the new things gain
| popular attention, the old techniques such as SPC are
| under less pressure to support success theater, and
| revert to being actual useful, solid tools.
| cambalache wrote:
| > You know the distribution of the phenomenon under study
|
| If you know the distribution of the phenomenon under study you
| dont need ML, that is what probability is for.
|
| > or make an explicit assumption and assume the risk of being
| wrong
|
| No.You have the Bias/Variance tradeoff here.You can make an
| explicit assumption about your model or not.
|
| > Using (1), you calculate how much data you need so you get an
| estimation error below x%
|
| This is extremely complicated for anything except the most
| trivial toy examples, probably not solvable at all and
| definitely not the way biological intelligent systems (aka some
| humans) do it.
| CabSauce wrote:
| Wait until you find out low many studies have been published in
| medical journals with serious statistical flaws.
| sfink wrote:
| Personally, I think the main problem with ML is simpler: it
| works well for interpolation, and is crap for extrapolation.
|
| If the outputs you want are well within the bounds of your
| training data set, ML can do wonders. If they aren't, it'll
| tell you that in 20 years everyone will be having -0.2 children
| and all the other species on the planet will start having to
| birth human babies just so they can be thrown into the smoking
| pit of bad statistical analysis.
| carlosf wrote:
| I agree, but that's equivalent to my original claim.
|
| Being bad at extrapolation is a consequence of assuming all
| training data can describe your phenomena distribution and
| being wrong.
| kvathupo wrote:
| As currymj commented, this isn't accurate for ML, only for
| classical statistics.
|
| In ML (or more specifically deep learning), we make no
| distribution-based assumptions, other than the fundamental
| assumption that our training data is "distributed like" our
| test data. Thus, there aren't issues with fat-tailed
| distributions since we make no such normality assumptions.
| Indeed, with the use of autoencoders, we don't assume a _single
| distribution_ , but rather a _stochastic process_.
|
| I suppose you could say statistics is less "empirical" than ML
| in the sense that it is axiom-based, whether that is a
| normality assumption of predictions about a regression line or
| stock prices following a Wiener process. By contrast, ML is
| less rationalist by simply reflecting data.
| peytn wrote:
| I dunno, there are definitely distribution-based assumptions
| --good luck working with skewed data. Most old-school
| techniques are kinda additive, so nobody's really been
| assuming a _single distribution_ for practical applications.
|
| Current ML techniques just work well for the kinds of
| problems people are applying them to, which is kind of a
| tautology. We should definitely seek to understand the theory
| behind stuff like dropout and not consider our lack of
| understanding a strength.
| clircle wrote:
| The only reason that this may not be accurate for ML is
| because machine learners generally make no attempt to
| quantify their uncertainty in their predictions with e.g.
| confidence intervals or prediction intervals.
|
| And there is a whole field of non-parametric statistics that
| doesn't make distribution assumptions.
| _dps wrote:
| It is absolutely untrue that DL is immune to fat-fail
| problems, and it is important that no one operate mission
| critical systems under this assumption.
|
| The two fat tail questions one has to engage are:
|
| - is it possible that a catastrophic input might be lurking
| in the wild that would not be present in a typical training
| set? Even with a 1M instance training set, a one-in-a-million
| situation will only appear (and affect your objective
| function) on average one time, and could very well not appear
| at all.
|
| - can I bound how badly I will suffer if my system is allowed
| to operate in the wild on such an input?
|
| DL gives no additional tools to engage these questions.
| kvathupo wrote:
| I don't quite follow: is not what you described a flaw
| fundamental to all forecasting; that is, the occurrence of
| a gross outlier? I should clarify that DL doesn't suffer
| from the same problem the normality condition has on fat-
| tails: a failure to capture the skew of the distribution.
| _dps wrote:
| It's not characteristic of all forecasting, only purely
| empirical forecasting.
|
| Definitionally, the only way to reason about risk that
| doesn't appear in training data is non-empirical (e.g. a
| priori assumptions about distributions, or worst cases,
| or out-of-paradigm tools like refusing to provide
| predictions for highly non-central inputs).
|
| DL is not any better (or worse) than any other purely
| empirical method at answering questions about fat-tail
| risk, and the only way to do better is to use non-
| empirical/a-priori tools. Obviously the tradeoff here is
| that your a priori assumptions can be wrong, and that too
| needs to be included in your risk model (see e.g. Robust
| Optimization / Robust Control).
| sjburt wrote:
| I think it's wrong to assume that non-empirical methods
| can be reliably trusted to give better results. Humans
| are terrible at avoiding bias or evaluating risks,
| especially for uncommon events.
| godelski wrote:
| > It is absolutely untrue that DL is immune to fat-fail
| problems
|
| In fact, working on fat tail problems is currently a hot
| topic in ML.
| fractionalhare wrote:
| _> In ML (or more specifically deep learning), we make no
| distribution-based assumptions, other than the fundamental
| assumption that our training data is "distributed like" our
| test data._
|
| Okay, so that's about the same as classical statistics.
| You're just waiving the requirement to know _what_ the
| distribution is. You are still assuming there exists a
| distribution and that it _holds_ in the future when you apply
| the model. Sure you may not be trying to estimate parameters
| of a distribution, but it is still there and all standard
| statistical caveats still apply.
|
| _> Indeed, with the use of autoencoders, we don 't assume a
| single distribution, but rather a stochastic process._
|
| Classical statistics frequently makes use of multiple
| distrutions and stochastic processes.
| potatoman22 wrote:
| Of course there's a distribution behind the data. The
| parent commenter was saying not all machine learning
| techniques need to know that distribution, as a refute to
| their parent comment.
| fractionalhare wrote:
| I know what they're saying, I even reiterate it in my
| second sentence. My point is that doesn't protect you
| from the distribution changing, which is a problem that
| applies to machine learning and classical statistics.
|
| This is in support of the GP comment: while you can
| loosen your assumptions about what the underlying
| distribution is and don't literally need to know it, you
| can't get away from the fundamental limitations of
| statistics. Which is the original topic we're talking
| about.
| mochomocha wrote:
| I agree that ML tends to put weaker assumptions on the data
| than classical statistics and that it's a good thing.
|
| However most ML certainly makes distributional assumptions -
| they are just weaker. When you're learning a huge deep net
| with an L2 loss on a regression task, you have a parametric
| conditional gaussian distribution under the hood. It's not
| because it's overparametrized that there's no distributional
| assumption. Vanilla autoencoders are also working under a
| multivariate gaussian setup as well. Most classifiers are
| trained under a multinomial distribution assumption etc.
|
| And fat-tailed distributions are definitely a thing. It's
| just less of a concern for the mainstream CV problems on
| which people apply DL.
| dumb1224 wrote:
| > I suppose you could say statistics is less "empirical" than
| ML in the sense that it is axiom-based, whether that is a
| normality assumption of predictions about a regression line
| or stock prices following a Wiener process. By contrast, ML
| is less rationalist by simply reflecting data.
|
| I don't think that's true (or maybe I misunderstood?), I
| guess your comment "simply reflecting data" means fitting
| data with a very flexible function (curve)? There are very
| flexible distributions to fit almost any kind of data e.g
| https://en.wikipedia.org/wiki/Gamma_distribution or with a
| composition of them, but as a practitioner you still need to
| interpret the model and check if it does represent the
| underlying process well. Both statistical inference and ML
| are getting there using different methods.
| boilerupnc wrote:
| [Disclosure: I'm an IBMer - not involved with this work]
|
| With regard to exploitation, IBM research has done some
| interesting work in the form of an open source "Adversarial
| Robustness Toolbox" [0]. "The open source Adversarial
| Robustness Toolbox provides tools that enable developers and
| researchers to evaluate and defend machine learning models and
| applications against the adversarial threats of evasion,
| poisoning, extraction, and inference."
|
| It's fascinating to think through how to design the 2nd and 3rd
| order side-effects using targeted data poisoning to achieve a
| specific outcome. Interestingly, poisoning could be to force a
| specific outcome for a one-time gain (e.g. feed data in a way
| to ultimately trigger an action that elicits some gain/harm) or
| to alter the outcomes over a longer time horizon (e.g. Teach
| the bot to behave in a socially unacceptable way)
|
| [0] https://art360.mybluemix.net/
| bigbillheck wrote:
| > 1 - You know the distribution of the phenomenon under study
| (or make an explicit assumption and assume the risk of being
| wrong)
|
| Nonparametric methods say 'hi'.
| astrophysician wrote:
| I agree -- as ML becomes increasingly easy to be applied by
| non-experts or people without a heavy math/stats background,
| I've seen an increasing volume of arguments against the data
| science profession (someone the other day called DS the "gate-
| keepers") but: there be dragons.
|
| Anyone can use SOTA deep learning models today, but in my
| experience, it's more important to understand the answer to
| "what are the shortcomings/consequences of using a particular
| method to solve this problem?" "what is (or could be) biases in
| this dataset?", etc. It requires a non-trivial understanding of
| the underlying methodology and statistics to reliably answer
| these questions (or at least worry about them).
|
| Can you apply deep reinforcement learning to your problem?
| Maybe. Should you? Well, it depends, and you should understand
| the pros and cons, which requires more than just the knowledge
| of how to make API calls. There are consequences to misusing
| ML/AI, and they may not even be obvious from offline testing
| and cross validation.
___________________________________________________________________
(page generated 2021-03-31 23:01 UTC)