[HN Gopher] What Data Can't Do
       ___________________________________________________________________
        
       What Data Can't Do
        
       Author : RiderOfGiraffes
       Score  : 144 points
       Date   : 2021-03-31 12:45 UTC (10 hours ago)
        
 (HTM) web link (www.newyorker.com)
 (TXT) w3m dump (www.newyorker.com)
        
       | temp8964 wrote:
       | This article is totally gibberish. It's a terrible mixture of
       | many unrelated things. Just because those things all have
       | something to do with data (anything can be presented in numeric
       | form), it does not make their issues are about data.
       | 
       | First, the Tony Blair example is not about data. It is a failure
       | of government planning. It's wrong politics and wrong economy.
       | 
       | The G.D.P. example is laughable. G.D.P. is never intended to be
       | used to compare individual cases. What kind of nonsense is this?
       | 
       | And the IQ example. The results are backed by decades of
       | extensive studies. The author thinks picking a few critics can
       | invalidate the whole field. And look! The white supremacist who
       | gave Asians the highest IQ, what a disgrace to his own ideology.
       | 
       | Many more. I feel it's kind of tactic to produce this kind of
       | article. Just glue a bunch of stuff, throw together with
       | somethings seem to be related, bam, you got an article.
        
         | cloogshicer wrote:
         | Gotta disagree here. This article is acknowledging a pattern,
         | that data is misused in many different areas.
         | 
         | I think the problem goes even deeper, which is a
         | misunderstanding of the scientific method. Good discussion
         | about this topic here:
         | https://news.ycombinator.com/item?id=26122712
        
           | temp8964 wrote:
           | "data is misused in many different areas" is not a valuable /
           | informative point.
           | 
           | There are many wrongs seem to have something to do with data,
           | but in fact they are not.
           | 
           | Like socialist economy planning will eventually fail, but
           | then you would say they misused data. It seems relevant, but
           | misusing the data is not the real cause of their failure at
           | all.
        
             | not2b wrote:
             | Replace "socialist" with "large company". The companies
             | gather data, establish metrics, and manage to those
             | numbers, and often bad things result. Ever been in a
             | company where some internal support function goes to hell
             | because its top manager's bonus depends on a metric, and
             | they can improve that metric by refusing to support the
             | users (find excuses to close IT support calls without
             | fixing the issue, etc).
        
               | temp8964 wrote:
               | Yes. But then some would say the company fail because it
               | "misused data", but it was not the real cause of the
               | company's failure. Any project involves using data could
               | blame the failure on "misused data" which is an useless
               | conclusion.
        
       | tppiotrowski wrote:
       | "once a useful number becomes a measure of success, it ceases to
       | be a useful number"
       | 
       | Two other unintended consequences of incentives I learned in
       | economics:
       | 
       | 1. Increasing fuel efficiency does not reduce gas consumption.
       | People just use their car more often.
       | 
       | 2. Asking people to pay-per-bag for garbage pickup resulted in
       | people dumping trash on the outskirts of town.
       | 
       |  _Edit: Did more research after downvote. Definitely double check
       | things you learn in college_
       | 
       | 1. The jury is still out:
       | https://en.wikipedia.org/wiki/Jevons_paradox
       | 
       | 2. Seems false
       | https://en.wikipedia.org/wiki/Pay_as_you_throw#Diversion_eff...
        
       | kerblang wrote:
       | They might as well have included the granddaddy example (as the
       | age of computing goes): The vietnam war, mcnamara & body counts.
        
       | neonate wrote:
       | https://archive.is/ynOm2
        
       | sradman wrote:
       | From the ungated archive [1]:
       | 
       | > Whenever you try to force the real world to do something that
       | can be counted, unintended consequences abound. That's the
       | subject of two new books about data and statistics: "Counting:
       | How We Use Numbers to Decide What Matters", by Deborah Stone,
       | which warns of the risks of relying too heavily on numbers, and
       | "The Data Detective", by Tim Harford, which shows ways of
       | avoiding the pitfalls of a world driven by data.
       | 
       | Data is a powerful feedback mechanism that can enable system
       | gamification; it can also expose it. The evil is extracting
       | unearned value from a system through gamification not the tools
       | employed to do so. I'm looking forward to reading both books.
       | 
       | [1] https://archive.is/ynOm2
        
       | kazinator wrote:
       | > _doctors would be given a financial incentive to see patients
       | within forty-eight hours._
       | 
       | Not measuring that from the first contact that the patient made
       | is simply dishonest.
       | 
       | "Call back in three days to make the appointment, so I can claim
       | you were seen within 48 hours, and therefore collect a bonus"
       | amounts to fraud because the transaction for obtaining that
       | appointment has already been initiated.
       | 
       | I mean, they could as well just give the person the appointment
       | in a secret, private appointment registry, and then copy the
       | appointments from that registry into the public one in such a way
       | that it appears most of the appointments are being made within
       | the 48 hour window. Nothing changes, other than that bonuses are
       | being fraudulently collected, but at least the doctor's office
       | isn't being a dick to the patients.
        
         | Eridrus wrote:
         | It's really hard to design a not game-able metric. The problem
         | here seems to be that doctors are under-provisioned for some
         | reason, and so long wait times are a form of load shedding for
         | the system. Without addressing this core issue, which
         | individual clinics have little control over because they are
         | generally boxed in by regulations over who can administer
         | medical care, except to rush appointments (which they're
         | probably already doing), there's not much they can do to solve
         | the problem, so all they can do is try to game the rules or not
         | get the bonuses.
        
           | closeparen wrote:
           | Doctors don't get to bill for idle time. Being less than
           | fully utilized is leaving money on the table. The idea here
           | is presumably to compensate them for leaving gaps in their
           | schedules.
        
           | [deleted]
        
         | DharmaPolice wrote:
         | It may amount to fraud but in the context of that service, no
         | record was kept of calls not resulting in an appointment. My
         | wife was a receptionist at a GPs around the time the article
         | mentions and in some cases it was worse than that - if you
         | phoned and asked for an appointment, if they couldn't give you
         | one within 48 hours they wouldn't offer one at all - telling
         | you to call back later / the next day.
         | 
         | Although the New Yorker piece has leaned on the bonus angle the
         | way it was discussed publicly was that doctors weren't allowed
         | to offer you appointments outside the 48 hour window [0].
         | 
         | It was a very silly interpretation of the rules, but I think
         | GPs felt it was too rigid and therefore stuck to the letter
         | rather than the spirit.
         | 
         | 0 - http://news.bbc.co.uk/1/hi/health/3682920.stm
        
         | ectopod wrote:
         | Not to defend the system, but I don't think they were trying to
         | fraudulently get her 48 hour payment. Rather, if they accepted
         | advance bookings (which was, after all, the old system) then
         | almost nobody would be seen in 48 hours. They could have
         | offered advance bookings for follow-ups, but since most
         | appointments are taken by the sickest people, many of them will
         | be follow-ups, so this probably wouldn't have helped.
         | 
         | If you want to reduce the queuing time in a system you need to
         | reduce the processing time (i.e. the duration of an
         | appointment) or increase the number of servers (i.e. doctors).
         | You can't do it by edict.
        
         | bluGill wrote:
         | On hindsight. However did you think of that before hearing of
         | the problem. Even if you did, can you think of - without much
         | time to think - how every possible metric I can propose on
         | every possible topic.
         | 
         | Tony Blair was trying to solve a real problem that needed
         | solving. That he opened a different problem is something that
         | we should think of as normal, and not blame him for trying to
         | solve the original problem. The question should be how to we
         | change the metric until the unintended consequences are ones we
         | can live with. That will probably take more than a lifetime to
         | work out.
         | 
         | Note that there will be a lot of debate. There are predicted
         | consequences that don't happen in the real world for whatever
         | reason. There are consequences that some feel we can live with
         | that others will not accept. Politics is messy.
        
       | wodenokoto wrote:
       | The article seems to stop pretty early, as if something is
       | missing.
       | 
       | It's an anecdote about a government incentive to have doctors see
       | patients within 48 hours causing doctors to refuse scheduling
       | patients later than 48 hours in order to get the incentive bonus.
       | 
       | This is not an example of limits of data, but an example of
       | perverse incentives.
        
         | justinclift wrote:
         | > It's an anecdote about a government incentive to have doctors
         | see patients within 48 hours causing doctors to refuse
         | scheduling patients later than 48 hours in order to get the
         | incentive bonus.
         | 
         | That part is probably just the first 1/8th or so of the article
         | (rough guess). Sounds like it was cut short for you?
        
         | polytely wrote:
         | Did you use reader mode? because I noticed that when I use
         | firefox's reader mode it will cut off part of the article on
         | the New Yorker site.
        
           | strathmeyer wrote:
           | We clicked on the link and were presented with two
           | paragraphs. New Yorker articles usually don't show up
           | correctly, I don't know why they allowed to be posted here.
           | Acting like we don't know how to read a webpage is
           | gaslighting.
        
             | polytely wrote:
             | Well that wasn't my intention at all, just a problem i
             | encountered a few hours before I read your comment so it
             | was fresh on my mind.
        
           | Fishkins wrote:
           | When I first opened the article in Firefox, I only saw the
           | first two paragraphs (same as the OP). This was true whether
           | it was in reader mode or not. I opened it in Chrome and saw
           | the whole article.
           | 
           | I just tried opening it in Firefox now (a couple hours later)
           | and I see the whole article. If I switch to reader mode I do
           | see it's truncated about halfway through, but I think that's
           | a separate issue from what the OP was seeing.
        
           | wodenokoto wrote:
           | On iOS. Initially used reader mode, but switched because it
           | seemed cut off.
           | 
           | But also without reader mode I can't see more than the NHS
           | anecdote.
        
         | jabroni_salad wrote:
         | When a measure becomes a target, it ceases to be a good
         | measure.
         | 
         | This case could be said to be creating misleading data. If the
         | doctor's offices aren't recording appointments more than 48
         | hours in advance, the System is losing visibility on the total
         | number of people who want appointments. Every office will
         | appear to be 100% efficient even though there is effectively
         | still an invisible waiting list.
        
         | [deleted]
        
       | jplr8922 wrote:
       | Confusing performance metrics and strategical objective is not a
       | data problem, it is a human problem. It happens to a lot of
       | people outside the usual Blair-WhiteNationalist-IQ crowd. I do
       | not think that advanced technical knowledge in ML or stats is
       | required to avoid this mistake ; it is the ability to perform
       | valid counterfactuals statements.
       | 
       | A good example of what I mean can be found on wikipedia :
       | 
       | His instinctive preference for offensive movement was typified by
       | an answer Patton gave to war correspondents in a 1944 press
       | conference. In response to a question on whether the Third Army's
       | rapid offensive across France should be slowed to reduce the
       | number of U.S. casualties, Patton replied, "Whenever you slow
       | anything down, you waste human lives."[103]
       | 
       | https://en.wikipedia.org/wiki/George_S._Patton
       | 
       | Here, US general Patton is not confounding a performance metric
       | (number of casualities) with strategic goal (winning the war).
       | His counterfactual statement could be that ''if we slow things
       | down, you are simply delaying future battles and increase the
       | _total_ number of casualties in order to achieve victory ''.
       | 
       | I'm not suprised at Blair decision. When we choose leaders, do we
       | favor long term strategic thinkers, or opportunistic pretty
       | faces?
        
       | lmm wrote:
       | Does the use of statistics actually amplify misunderstanding, or
       | merely reveal misunderstandings that were already there? In any
       | of these examples given - predicting rearrests, infant mortality,
       | or so on - it's hard to imagine that someone not using numbers
       | would have reached a conclusion that was any closer to the truth.
       | 
       | Data has its limits, but the solution is usually - maybe even
       | always - more data, not less.
        
         | mjburgess wrote:
         | It's pretty trivial to predict things without "data". Data just
         | means using some measurement system to obtain measurements of
         | some target phenomenon. Many targets cannot be measured, or
         | have not occurred to be measured.
         | 
         | Reasoning counter-factually is trivial: What would happen if I
         | dropped this object in this place in which an object, of this
         | kind, has never been dropped before?
         | 
         | Well apply relevant models, etc. and "the object falls, rolls,
         | pivots, etc.".
         | 
         | This is reasoning-forward _from_ models, rather than backwards
         | from data. And it 's the heart of anything that makes any
         | sense.
         | 
         | Data is not a model and provides no model. The "statistics of
         | mere measurement" is a dangerously utopian misunderstanding of
         | what data is. The world does not tell you, via measurement,
         | what it is like.
        
       | kkoncevicius wrote:
       | This article is not so much about the data, as it is about rules
       | and thresholds used to divide that data into groups.
        
       | mdip wrote:
       | Kind of love the initial story in the article about 48-hour wait
       | times.
       | 
       | I had a stint writing conferencing software for quite some time,
       | and every once in a while we'd come across a customer requirement
       | that had capabilities which were obvious to us developers "would
       | be misused". As a result, we did the "Thinking, Fast and Slow"
       | pre-mortem to help surface other ways that the system could be
       | attacked (along with what we would do to prevent it and how it
       | impacted the original feature).
       | 
       | If you create something, and open it to the public, and there's
       | _any way_ for someone to misuse it for financial incentive
       | (especially if they can do so without consequence), it _will_ be
       | misused. In fact, depending on the incentive, you may find that
       | the misuse becomes the only way that the service is used.
        
         | kazinator wrote:
         | When the doctor's office is inundated with patient visits, you
         | cannot fix scheduling back-logs by fiddling with the scheduling
         | algorithm, no matter how much data you have.
         | 
         | Say the calendar is initially empty and 1000 people want to see
         | the doc, right now. You can fill them all into the calendar, or
         | you can play games that solve nothing, like only filling
         | tomorrow's schedule with 10 people, asking 990 of them to call
         | back. That doesn't change the fact that it takes 100 days to
         | see 1000 patients. All it does is cause unfair delays; the
         | original 1000 can be pre-empted by newcomers who get earlier
         | appointments since their place in line is not being maintained.
        
         | strathmeyer wrote:
         | How do we read more than the initial story? Do we have to pay
         | to read it? There is no indication on the webpage there is more
         | than two paragraphs other than the advertisement for the
         | author's book.
        
         | lifeisstillgood wrote:
         | conferencing as in 'ComicCon' or 'Zoom'? Can you give an
         | example?
        
           | mdip wrote:
           | Haha, hadn't even thought of that -- Conferencing as in
           | developing bespoke (and some white-label) software for
           | organizations deploying Office Communications Server (and
           | R2), Lync and ultimately Skype for Business (I do a little
           | Teams work these days but I am focused on other areas,
           | presently).
        
       | foolinaround wrote:
       | around paywall : https://archive.vn/ynOm2
        
       | tomrod wrote:
       | An absolutely fantastic article that captures my concerns as a
       | user, purveyor, and automater of systems that help with numbers.
       | I'm always very cautious regarding the jump from numbers
       | informing to numbers deciding.
        
       | visarga wrote:
       | Data is very limited, indeed. We can't predict outside the
       | distribution, or unrelated events (without a causal link), or
       | random events in the future. We should be humble about the limits
       | of data.
        
         | _rpd wrote:
         | Sure, but coding a human-equivalent response to such events is
         | trivial (because no one and nothing responds well to such
         | events).
        
       | ppod wrote:
       | This author has published a couple of articles like this at the
       | New Yorker They all have this in common: the author works through
       | some interesting and in some ways unusual cases where data or
       | statistics have been improperly or naively applied, with some
       | social costs. I really enjoy the articles themselves.
       | 
       | Then the New Yorker packages it up with a cartoon and a headline
       | and subheadline like "Big Data: When will it eat our children?"
       | or "Numbers: Do they even have souls?", and serves it up to their
       | technophobic audience in a palatable way.
       | 
       | https://www.newyorker.com/contributors/hannah-fry
        
         | dsaavy wrote:
         | > *Numbers don't lie, except when they do. Harford is right to
         | say that statistics can be used to illuminate the world with
         | clarity and precision. They can help remedy our human
         | fallibilities. What's easy to forget is that statistics can
         | amplify these fallibilities, too. As Stone reminds us, "To
         | count well, we need humility to know what can't or shouldn't be
         | counted."*
         | 
         | I do have a problem with her conclusion here. Are numbers
         | really lying if it's actually an incorrect data collection
         | method or conflicting definitions of criteria for generation of
         | certain numbers (like the example used in the second to last
         | paragraph)? She seems to be pointing out a more important fact,
         | which is that people don't question underlying data, how it was
         | collected, and the choices those data collectors made when
         | making a data set. People tend to take data and conclusions
         | drawn from it as objective realities, when in reality data is
         | way more subjective.
        
           | lupire wrote:
           | > Are numbers really lying if it's actually an incorrect data
           | collection method or conflicting definitions of criteria for
           | generation of certain numbers
           | 
           | Obviously it's a figurative metaphor, but it's pretty clearly
           | a case of "this supposedly objective factual calculation is
           | presenting an untruth."
        
           | not2b wrote:
           | One of the points is that the act of collecting the numbers
           | and making decisions based on them can change the underlying
           | behavior. The numbers can be perfectly correct (how many
           | cases does the IT department get? How long does it take on
           | average to resolve the issue?). The goal can be correct (we
           | want to get issues resolved faster). But as soon as you try
           | to manage people based on those perfectly valid numbers, but
           | things often happen.
        
         | quietbritishjim wrote:
         | Hannah Fry is mathematics communicator working in quite a few
         | other media. She's been on a couple of BBC documentaries, and
         | on a few videos on the Numberphile YouTube channel (which is
         | also very good regardless of who's on it)
        
       | canadianwriter wrote:
       | Data always needs to be paired with empathy. ML/AI simply doesn't
       | have empathy so it will always be missing a piece of the overall
       | pie.
       | 
       | Let AI crunch the numbers, but combine it with a human who can
       | understand the "why" of things and you can really kick butt.
        
         | gnulinux wrote:
         | I agree with you, although, unfortunately, most -- if not all
         | -- engineers I know would respond to this by complaining about
         | how "a human who can understand the 'why'" cannot be automated.
        
       | williesleg wrote:
       | solve covid, not a vaccine but a treatment, those are forbidden.
        
       | Ozzie_osman wrote:
       | Data is not a substitute for good judgment, for empathy, for
       | proper incentives.
       | 
       | The article focuses on governments and bureaucracies but there's
       | no better example than "data-driven" tech companies, as we A/B
       | test our key engagement metrics all the way to soulless products
       | (with, of course, a little machine learning thrown in to juice
       | the metrics).
       | 
       | I wrote about this before:
       | https://somehowmanage.com/2020/08/23/data-is-not-a-substitut...
        
         | gen220 wrote:
         | I've written the same sentence before! This is so cool! pardon
         | the wall of text.
         | 
         | Here's my thesis, curious to hear your thoughts.
         | 
         | At some time around 2005, when efficient persistence and
         | computation became cheap enough that any old f500-corp could
         | afford to endlessly collect data forever, something happened.
         | 
         | Before 2005, if a company needed to make a big corporate
         | decision, there was _some_ data involved in making the
         | decision, but it was obviously riddled with imperfections and
         | aggregations biases.
         | 
         | Before 2005, executives needed to be seasoned by experience, to
         | develop this thing we call "Good Judgement", that allows them
         | to make productive inferences from a paucity of data. The
         | Corporate Hierarchy was a contest for who could make the best
         | inferences.
         | 
         | Post-2005, data collection is ubiquitous. Individuals and
         | companies realized that you don't need to pay people with
         | experience any more, you can simply collect better data, and
         | outsource decision-making to interpretations of this data. The
         | corporate hierarchy now is all about how can gather the "best"
         | data, where "best" means grow the money pile by X% this
         | quarter.
         | 
         | "Good Judgement" used to be expected from the CEO, down to at
         | least 1-3 levels of middle management above the front-line
         | people. Now, it appears (to me) to be mostly a feature of the
         | C-Suite and Boards, and it's disappeared elsewhere. Long-term,
         | high-performing companies seem to have a more diffused sense of
         | good judgement. But these are rare. maybe they always have
         | been?
         | 
         | Anyways, as we agree, this has a tendency to lead in
         | problematic directions. Here's my thesis on "why".
         | 
         | Fundamentally, any "data" is reductive of human experience.
         | It's like a photograph that captures a picture by excluding the
         | rest of the world.
         | 
         | Few people seem to understand this analogy, because they think
         | photographs are the ultimate record of an event. Lawyers
         | understand this analogy. With the right framing, angle,
         | lighting (and of course, with photoshop), you can make a
         | photograph tell any story you want.
         | 
         | It's the same issue with data, arguably worse since we don't
         | have a set of standard statistics. We have no GAAP-equivalent
         | for data science (yet?).
         | 
         | Our predecessors understood that data was unreliable, and
         | compensated for this fact by selecting for "Good Judgement".
         | The modern mega-corps demonstrate that we don't have a good
         | understanding of this today, evidenced by religious "data-
         | driven" doctrine, as you describe.
         | 
         | People will say "hey! at least some data is better than no
         | data!", to which I'll say data is useless and even harmful in
         | lieu of capable interpreters. In 2021, have an abundance of
         | data, but a paucity of people who are capable of critical
         | interpretation thereof.
         | 
         | I don't know if it's a worse situation than we had 20 years
         | ago. But it's definitely a different situation, that requires a
         | new approach. I think people are taking notice of it, so I'm
         | hopeful.
        
           | cloogshicer wrote:
           | Thank you for writing this, I enjoyed reading it and largely
           | agree.
        
         | mminer237 wrote:
         | I think it's almost worse in tech because it largely works. If
         | the government sets a flawed metric, their real goal of
         | pleasing their constituents has failed and theoretically they
         | either have to fix it or lose political support.
         | 
         | But in tech, if your goal is just to make money, soullessly
         | following data will often get you there, to the detriment of
         | everyone else. Clickbait headlines will get you more views.
         | Full-page popup ads will get you more ad clicks/newsletter
         | subscriptions. Microtransactions will get you more sales.
         | Gambling mechanics will get you more microtransactions.
         | 
         | You can say it's a flawed metric, but I think in the end, most
         | people just actually care more about making money than they do
         | about building a good product.
        
       | carlosf wrote:
       | I am increasingly worried with people applying ML in everything
       | without any rigour.
       | 
       | Statical inference generally only works well in very specific
       | conditions:
       | 
       | 1 - You know the distribution of the phenomenon under study (or
       | make an explicit assumption and assume the risk of being wrong)
       | 
       | 2 - Using (1), you calculate how much data you need so you get an
       | estimation error below x%
       | 
       | Even though most ML models are essentially statistics and have
       | all the same limitations (issues with convergence, fat tailed
       | distributions, etc...) it seems the industry standard is to
       | pretend none of that exists and hope for the best.
       | 
       | IMO the best moneymaking opportunities in the decade will involve
       | exploiting unsecured IOT devices and naive ML models, we will
       | have plenty of those.
        
         | currymj wrote:
         | i think this actually gets at what makes applied ML distinct
         | from statistics as a practice, even though there is a ton of
         | overlap.
         | 
         | statisticians make assumptions 1 and 2, and think of themselves
         | as trying to find the "correct" parameters of their model.
         | 
         | people doing applied ML typically assume they don't know 1
         | (although they might implicitly make some weak assumptions like
         | sub-gaussian to avoid fat tails, etc.) and also typically don't
         | care about being able to do 2. and they don't care about their
         | parameters; in a sense to an ML practitioner, every parameter
         | is a nuisance parameter.
         | 
         | instead you assume you have some reliable way of evaluating
         | performance on the task you care about -- usually measuring
         | performance on an unseen test set. as long as this is actually
         | reliable, then things are fine.
         | 
         | but you are right that in the face of a shifting distribution
         | or an adversary crafting bad inputs, ML models can break down
         | -- but there is actually a lot of research on ways to deal with
         | this, which will hopefully reach industry sooner rather than
         | later.
        
           | QuesnayJr wrote:
           | "Every parameter is a nuisance parameter" is a great way to
           | put it.
        
           | RobinL wrote:
           | Yes - this is pretty much exactly how I explain the
           | difference between machine learning and statistics.
           | 
           | Despite using similar models, the expertise required for
           | 'doing statistics' (statistical inference) is actually very
           | different from machine learning. Machine learning fits into
           | the 'hacker mentality' well - try stuff out see what works.
           | To do statistical inference effectively, you really do need
           | to spend time learning the theory. They both require deep
           | skills - but the skills are surprisingly different
           | considering it's often the same underlying model.
        
             | nickforr wrote:
             | But without some statistical knowledge, isn't there a risk
             | of a lack of understanding about the robustness of "what
             | works"?
        
               | splithalf wrote:
               | Statistical knowledge doesn't remove that risk. The
               | extent to which it even lowers the risk is a question
               | that could be answered empirically.
        
               | RobinL wrote:
               | yeah, agreed - a good understanding of the model's
               | statistical assumptions can often help you make the model
               | more robust and also give you ideas for what types of
               | feature engineering are likely to work.
        
           | carlosf wrote:
           | I disagree, every ML model has some implicit statistical
           | assumption, which is often not well understood by
           | practitioners.
           | 
           | At minimum you must assume your underlying process is not fat
           | tailed. If it is, then your training/validation/test data
           | might never be enough to make reliable predictions and your
           | model might break constantly in prod.
           | 
           | BTW shifting distributions and fat tailed distributions are
           | sort of equivalent, at least mathematically.
        
             | currymj wrote:
             | I don't disagree with any of that, but I still think a
             | responsible, clear-thinking ML practitioner can avoid
             | having to assume the form of the data-generating process,
             | depending on their application.
             | 
             | In some cases if you care about PAC generalization bounds,
             | it's even the case that the bounds do actually hold for all
             | possible distributions.
        
               | dumb1224 wrote:
               | I think it's more meaningful to have the discussion in a
               | specific problem domain since statistical inference or ML
               | are just tools to better model a problem / phenomenon.
               | The domain (prior) knowledge -- everything else that's
               | not stats / ML, are the keys to build a more robust
               | model. Leave the problem domain out we are left just with
               | pure mathematical theories and the points can only be
               | proved by simulated data.
        
           | 6gvONxR4sf7o wrote:
           | > instead you assume you have some reliable way of evaluating
           | performance on the task you care about -- usually measuring
           | performance on an unseen test set. as long as this is
           | actually reliable, then things are fine.
           | 
           | This is the part that often fails in practice. Think of all
           | the benchmarks that show superhuman performance and compare
           | that to how good those same models really aren't.
           | Constructing a good set of holdouts to evaluate on is really
           | hard and gets back to similar issues. In practice, doing what
           | you're describing reliably (in a way that actually implies
           | you should have confidence in your model once you roll it
           | out) is rarely as simple as holding out some random bit of
           | your dataset out and checking performance on it.
           | 
           | On the other hand, what you often see is people just holding
           | out a random bunch of rows.
        
         | erichahn wrote:
         | Isn't the point of ML exactly that you don't know the
         | underlying distribution? How is this ever assumed in any way?
         | ML is not parametric statistics.
        
           | contravariant wrote:
           | Well, all optimization problems are equivalent to a maximum
           | likelihood estimate for a corresponding probability
           | distribution so you may make more implicit assumptions than
           | you think.
           | 
           | Typical ML methods just have a _huge_ distribution space that
           | can fit almost anything from which they pick just 1 option.
           | This has two downsides:
           | 
           | Since your distribution space is several times too large by
           | design you lose the ability to say anything useful about the
           | accuracy of your estimate, other than that it is not the only
           | option _by far_.
           | 
           | Since you must pick 1 option from your parameter space you
           | may miss slightly less likely explanations that may still
           | have huge consequences, which means your models tend to end
           | up overconfident.
        
           | nonameiguess wrote:
           | (Some) ML is non-parametric, but there are always some
           | questions you need to be able to answer about your data. At
           | bare minimum, is the generating process ergodic, what is the
           | error of your measurement procedure, how representative of
           | the true underlying distribution is your sampling procedure?
           | All use of data should start with some exploratory analysis
           | before you ever get to the modeling stage.
           | 
           | Once you have a model, at minimum understand how to tune for
           | the tradeoffs of different types of error and don't naively
           | optimize for pure accuracy. At the obvious extremes, if
           | you're trying to prevent nuclear attack, false negatives are
           | much more costly than false positives, if you're trying to
           | figure out whether to execute someone for murder, false
           | positives are much more costly than false negatives.
           | Understand the relative costs of different types of error for
           | whatever you're trying to predict and proceed accordingly.
        
         | rademacher wrote:
         | The problem is high dimensions knowing the distribution or even
         | characterizing it fully with data is incredibly difficult
         | (curse of dimensionality). I think the real assumption in ML is
         | just that there is some low dimensional space that
         | characterizes the data well and ML algorithms find these
         | directions where the data is constant.
        
         | iagovar wrote:
         | ML looks (for many peole) like a way to circunvent your grumpy
         | statiscian saying that the underlying data is worthless and/or
         | you should focus on getting the data pipeline done properly for
         | a logit model on your churn rate.
        
           | analog31 wrote:
           | "Scientist free science," -- being able to optimize systems
           | without understanding them, has been a dream of the business
           | world since the dawn of time. There's always been a market
           | for cookbook recipes that automate the collection of data,
           | and interpretation of results. Before ML, there were "design
           | of experiments," and "statistical quality control."
        
             | carlmr wrote:
             | >Before ML, there were "design of experiments," and
             | "statistical quality control."
             | 
             | Statistical quality control, at least the way I know it, is
             | very useful in finding problems in your process. I'm also
             | not sure how this fits with your premise. It's about
             | optimizing systems by first finding out where to look, and
             | then looking there in detail with expert knowledge, i.e.
             | deep understanding of your system.
        
               | analog31 wrote:
               | I'm definitely with you there, but I've also seen the
               | side of it where it turns into a cargo cult and runs
               | headlong into the replication crisis.
               | 
               | Perhaps the good thing is that as the new things gain
               | popular attention, the old techniques such as SPC are
               | under less pressure to support success theater, and
               | revert to being actual useful, solid tools.
        
         | cambalache wrote:
         | > You know the distribution of the phenomenon under study
         | 
         | If you know the distribution of the phenomenon under study you
         | dont need ML, that is what probability is for.
         | 
         | > or make an explicit assumption and assume the risk of being
         | wrong
         | 
         | No.You have the Bias/Variance tradeoff here.You can make an
         | explicit assumption about your model or not.
         | 
         | > Using (1), you calculate how much data you need so you get an
         | estimation error below x%
         | 
         | This is extremely complicated for anything except the most
         | trivial toy examples, probably not solvable at all and
         | definitely not the way biological intelligent systems (aka some
         | humans) do it.
        
         | CabSauce wrote:
         | Wait until you find out low many studies have been published in
         | medical journals with serious statistical flaws.
        
         | sfink wrote:
         | Personally, I think the main problem with ML is simpler: it
         | works well for interpolation, and is crap for extrapolation.
         | 
         | If the outputs you want are well within the bounds of your
         | training data set, ML can do wonders. If they aren't, it'll
         | tell you that in 20 years everyone will be having -0.2 children
         | and all the other species on the planet will start having to
         | birth human babies just so they can be thrown into the smoking
         | pit of bad statistical analysis.
        
           | carlosf wrote:
           | I agree, but that's equivalent to my original claim.
           | 
           | Being bad at extrapolation is a consequence of assuming all
           | training data can describe your phenomena distribution and
           | being wrong.
        
         | kvathupo wrote:
         | As currymj commented, this isn't accurate for ML, only for
         | classical statistics.
         | 
         | In ML (or more specifically deep learning), we make no
         | distribution-based assumptions, other than the fundamental
         | assumption that our training data is "distributed like" our
         | test data. Thus, there aren't issues with fat-tailed
         | distributions since we make no such normality assumptions.
         | Indeed, with the use of autoencoders, we don't assume a _single
         | distribution_ , but rather a _stochastic process_.
         | 
         | I suppose you could say statistics is less "empirical" than ML
         | in the sense that it is axiom-based, whether that is a
         | normality assumption of predictions about a regression line or
         | stock prices following a Wiener process. By contrast, ML is
         | less rationalist by simply reflecting data.
        
           | peytn wrote:
           | I dunno, there are definitely distribution-based assumptions
           | --good luck working with skewed data. Most old-school
           | techniques are kinda additive, so nobody's really been
           | assuming a _single distribution_ for practical applications.
           | 
           | Current ML techniques just work well for the kinds of
           | problems people are applying them to, which is kind of a
           | tautology. We should definitely seek to understand the theory
           | behind stuff like dropout and not consider our lack of
           | understanding a strength.
        
           | clircle wrote:
           | The only reason that this may not be accurate for ML is
           | because machine learners generally make no attempt to
           | quantify their uncertainty in their predictions with e.g.
           | confidence intervals or prediction intervals.
           | 
           | And there is a whole field of non-parametric statistics that
           | doesn't make distribution assumptions.
        
           | _dps wrote:
           | It is absolutely untrue that DL is immune to fat-fail
           | problems, and it is important that no one operate mission
           | critical systems under this assumption.
           | 
           | The two fat tail questions one has to engage are:
           | 
           | - is it possible that a catastrophic input might be lurking
           | in the wild that would not be present in a typical training
           | set? Even with a 1M instance training set, a one-in-a-million
           | situation will only appear (and affect your objective
           | function) on average one time, and could very well not appear
           | at all.
           | 
           | - can I bound how badly I will suffer if my system is allowed
           | to operate in the wild on such an input?
           | 
           | DL gives no additional tools to engage these questions.
        
             | kvathupo wrote:
             | I don't quite follow: is not what you described a flaw
             | fundamental to all forecasting; that is, the occurrence of
             | a gross outlier? I should clarify that DL doesn't suffer
             | from the same problem the normality condition has on fat-
             | tails: a failure to capture the skew of the distribution.
        
               | _dps wrote:
               | It's not characteristic of all forecasting, only purely
               | empirical forecasting.
               | 
               | Definitionally, the only way to reason about risk that
               | doesn't appear in training data is non-empirical (e.g. a
               | priori assumptions about distributions, or worst cases,
               | or out-of-paradigm tools like refusing to provide
               | predictions for highly non-central inputs).
               | 
               | DL is not any better (or worse) than any other purely
               | empirical method at answering questions about fat-tail
               | risk, and the only way to do better is to use non-
               | empirical/a-priori tools. Obviously the tradeoff here is
               | that your a priori assumptions can be wrong, and that too
               | needs to be included in your risk model (see e.g. Robust
               | Optimization / Robust Control).
        
               | sjburt wrote:
               | I think it's wrong to assume that non-empirical methods
               | can be reliably trusted to give better results. Humans
               | are terrible at avoiding bias or evaluating risks,
               | especially for uncommon events.
        
             | godelski wrote:
             | > It is absolutely untrue that DL is immune to fat-fail
             | problems
             | 
             | In fact, working on fat tail problems is currently a hot
             | topic in ML.
        
           | fractionalhare wrote:
           | _> In ML (or more specifically deep learning), we make no
           | distribution-based assumptions, other than the fundamental
           | assumption that our training data is  "distributed like" our
           | test data._
           | 
           | Okay, so that's about the same as classical statistics.
           | You're just waiving the requirement to know _what_ the
           | distribution is. You are still assuming there exists a
           | distribution and that it _holds_ in the future when you apply
           | the model. Sure you may not be trying to estimate parameters
           | of a distribution, but it is still there and all standard
           | statistical caveats still apply.
           | 
           |  _> Indeed, with the use of autoencoders, we don 't assume a
           | single distribution, but rather a stochastic process._
           | 
           | Classical statistics frequently makes use of multiple
           | distrutions and stochastic processes.
        
             | potatoman22 wrote:
             | Of course there's a distribution behind the data. The
             | parent commenter was saying not all machine learning
             | techniques need to know that distribution, as a refute to
             | their parent comment.
        
               | fractionalhare wrote:
               | I know what they're saying, I even reiterate it in my
               | second sentence. My point is that doesn't protect you
               | from the distribution changing, which is a problem that
               | applies to machine learning and classical statistics.
               | 
               | This is in support of the GP comment: while you can
               | loosen your assumptions about what the underlying
               | distribution is and don't literally need to know it, you
               | can't get away from the fundamental limitations of
               | statistics. Which is the original topic we're talking
               | about.
        
           | mochomocha wrote:
           | I agree that ML tends to put weaker assumptions on the data
           | than classical statistics and that it's a good thing.
           | 
           | However most ML certainly makes distributional assumptions -
           | they are just weaker. When you're learning a huge deep net
           | with an L2 loss on a regression task, you have a parametric
           | conditional gaussian distribution under the hood. It's not
           | because it's overparametrized that there's no distributional
           | assumption. Vanilla autoencoders are also working under a
           | multivariate gaussian setup as well. Most classifiers are
           | trained under a multinomial distribution assumption etc.
           | 
           | And fat-tailed distributions are definitely a thing. It's
           | just less of a concern for the mainstream CV problems on
           | which people apply DL.
        
           | dumb1224 wrote:
           | > I suppose you could say statistics is less "empirical" than
           | ML in the sense that it is axiom-based, whether that is a
           | normality assumption of predictions about a regression line
           | or stock prices following a Wiener process. By contrast, ML
           | is less rationalist by simply reflecting data.
           | 
           | I don't think that's true (or maybe I misunderstood?), I
           | guess your comment "simply reflecting data" means fitting
           | data with a very flexible function (curve)? There are very
           | flexible distributions to fit almost any kind of data e.g
           | https://en.wikipedia.org/wiki/Gamma_distribution or with a
           | composition of them, but as a practitioner you still need to
           | interpret the model and check if it does represent the
           | underlying process well. Both statistical inference and ML
           | are getting there using different methods.
        
         | boilerupnc wrote:
         | [Disclosure: I'm an IBMer - not involved with this work]
         | 
         | With regard to exploitation, IBM research has done some
         | interesting work in the form of an open source "Adversarial
         | Robustness Toolbox" [0]. "The open source Adversarial
         | Robustness Toolbox provides tools that enable developers and
         | researchers to evaluate and defend machine learning models and
         | applications against the adversarial threats of evasion,
         | poisoning, extraction, and inference."
         | 
         | It's fascinating to think through how to design the 2nd and 3rd
         | order side-effects using targeted data poisoning to achieve a
         | specific outcome. Interestingly, poisoning could be to force a
         | specific outcome for a one-time gain (e.g. feed data in a way
         | to ultimately trigger an action that elicits some gain/harm) or
         | to alter the outcomes over a longer time horizon (e.g. Teach
         | the bot to behave in a socially unacceptable way)
         | 
         | [0] https://art360.mybluemix.net/
        
         | bigbillheck wrote:
         | > 1 - You know the distribution of the phenomenon under study
         | (or make an explicit assumption and assume the risk of being
         | wrong)
         | 
         | Nonparametric methods say 'hi'.
        
         | astrophysician wrote:
         | I agree -- as ML becomes increasingly easy to be applied by
         | non-experts or people without a heavy math/stats background,
         | I've seen an increasing volume of arguments against the data
         | science profession (someone the other day called DS the "gate-
         | keepers") but: there be dragons.
         | 
         | Anyone can use SOTA deep learning models today, but in my
         | experience, it's more important to understand the answer to
         | "what are the shortcomings/consequences of using a particular
         | method to solve this problem?" "what is (or could be) biases in
         | this dataset?", etc. It requires a non-trivial understanding of
         | the underlying methodology and statistics to reliably answer
         | these questions (or at least worry about them).
         | 
         | Can you apply deep reinforcement learning to your problem?
         | Maybe. Should you? Well, it depends, and you should understand
         | the pros and cons, which requires more than just the knowledge
         | of how to make API calls. There are consequences to misusing
         | ML/AI, and they may not even be obvious from offline testing
         | and cross validation.
        
       ___________________________________________________________________
       (page generated 2021-03-31 23:01 UTC)