[HN Gopher] It Takes Long to Become Gaussian
       ___________________________________________________________________
        
       It Takes Long to Become Gaussian
        
       Author : kqr
       Score  : 43 points
       Date   : 2023-07-20 05:36 UTC (17 hours ago)
        
 (HTM) web link (two-wrongs.com)
 (TXT) w3m dump (two-wrongs.com)
        
       | jerf wrote:
       | Very important article. Software engineering also has a lot of
       | fat-tailed distributions that can masquerade as normal
       | distributions. Our intuition tends to assume normal distributions
       | to one degree or another and we must learn when to disregard it.
       | 
       | One example that is both very important to software engineers and
       | will hit close to home is effort estimations. It is a super fat-
       | tailed distribution. The problem is that a lot of tools managers
       | will use to try to make schedules implicitly assume that errors
       | will essentially be Gaussian. I can draw this fancy Gantt chart
       | because OK, sure, task 26 may be a couple of days late, but maybe
       | task 24 will be a couple of days early. As we all know from real
       | life, tasks are effectively _never_ early, and it 's completely
       | expected that at least one task from any non-trivial project will
       | be budgeted for 2 weeks in the early planning stage but actually
       | take 4 months. Managers are driven batty by this because
       | intuitively this feels like a 7 or 8 sigma event... but in real
       | life, it isn't, it's so predictable that I just predicted it in
       | your project, sight unseen, even though I know nothing about your
       | project other than the fact it is "nontrivial".
       | 
       | But there's a ton of places in software engineering, especially
       | at any kind of scale, you must learn to discard Gaussian
       | assumptions and deal with fatter tailed distributions. I get a
       | lot of mileage out of the ideas in this essay. It's hard to give
       | a really concrete guide to how to do that, but it's a very
       | valuable skill if you can figure out how to learn it.
        
         | angarg12 wrote:
         | It might sound silly when spoken out loud, but I found the
         | following example quite enlightening about this phenomenon:
         | imagine that you want to drive to the airport, and you need to
         | estimate how long it will take you to get there. You can simply
         | measure how long it will take you to drive by dividing average
         | speed by distance.
         | 
         | If you actually drove there many times and measured your trip
         | length, most of the time you would end up being later,
         | sometimes horribly so.
         | 
         | Why is that? The explanation is actually trivial: no matter how
         | well things work, there is an absolute minimum time required to
         | get to the airport. On the other hand, there are many things
         | that might delay your trip: heavy traffic, closed roads,
         | accidents...
         | 
         | So, on the aggregate, there are very few things that can "go
         | well", and they will reduce your trip time only a bit. On the
         | other hand there are many things that can go wrong and make
         | your trip last much, much longer.
         | 
         | That's a sort of intuitive explanation of why things like
         | software estimates are fat-tailed instead of normal.
        
           | taeric wrote:
           | Ironically, that anecdote also paints a good reason to sample
           | the data on how long it takes to get there. If you check your
           | own history for how long it takes to get somewhere, it is
           | probably far more predictable than you'd expect.
           | 
           | Of course, this also explains why software estimates are
           | hard. We are all too often estimating things we haven't done
           | before.
        
         | prometheus76 wrote:
         | I have been working in custom steel fabrication for most of my
         | career, and our distributions are all fat-tail distributions.
         | The main trick we've learned is to focus on and obsess over
         | outliers at every stage of moving through our shop. If
         | something has been sitting for more than three days, it's
         | brought up in every meeting until it's moving again.
         | 
         | The hardest part to communicate to accountant types and
         | management types (especially if they come to our company from a
         | manufacturing background) is that our system is and always will
         | be a chaotic system, and therefore inherently unpredictable.
         | 
         | The reason we are a chaotic system is because the base element
         | in our system is a person, and people have inherently chaotic
         | productivity output.
         | 
         | Another major contributor to the chaos is the Pareto
         | distribution among workers of productivity. When scheduling
         | plans a task to take 10 hours, what they can't account for is
         | that if employee A does it, it will take 4 hours, and if
         | employee B does it, it will take 25 hours.
         | 
         | I could go on and on with other layers of complexity that
         | create long fat-tail distributions for us, but you get the
         | point.
        
       | cozzyd wrote:
       | And be especially careful if your distribution is bounded and you
       | are somewhat near the bound.... (Gammas come in handy)
        
       | nequo wrote:
       | This, along with the conclusion, is wrong in the first example:
       | N(m=30x0.2,s=30x0.9).            With these parameters, 16 MB
       | should have a z score of about 2, and my computer Using slightly
       | more precise numbers. helpfully tells me it's 2.38, which under
       | the normal distribution corresponds to a probability of 0.8 %.
       | [...]            The last point is the most important one for the
       | question we set out with. The actual risk of the files not
       | fitting is 6x higher than the central limit theorem would have us
       | believe.
       | 
       | The probability implied by the normal distribution is 2.1%, not
       | 0.8%:                 > 1 - pnorm(16, 30 * 0.2, sqrt(30) * 0.9)
       | [1] 0.02124942
       | 
       | So the actual probability of the files not fitting is only 2.26
       | times higher than implied probability. I think this is actually
       | quite impressive. At n = 30, we are talking about a very small
       | sample, one for which we don't usually use the central limit
       | theorem.
        
         | thebooktocome wrote:
         | To be clear, the recommendation of N=30 came from the LessWrong
         | article the author is critiquing, not the author themselves.
        
       | cb321 wrote:
       | https://link.springer.com/content/pdf/10.1007/3-540-59222-9_...
        
       | kubeia wrote:
       | From the Statistical Consequences of Fat Tails (N. Taleb):
       | 
       | "We will address ad nauseam the central limit theorem but here is
       | the initial intuition. It states that n-summed independent random
       | variables with finite second moment end up looking like a
       | Gaussian distribution. Nice story, but how fast? Power laws on
       | paper need an infinity of such summands, meaning they never
       | really reach the Gaussian. Chapter 7 deals with the limiting
       | distributions and answers the central question: "how fast?" both
       | for CLT and LLN. How fast is a big deal because in the real world
       | we have something different from n equals infinity."
       | 
       | https://arxiv.org/abs/2001.10488
        
       | Bostonian wrote:
       | The author's stock market example fails to talk about percentage
       | returns and makes no sense to me.
       | 
       | "Will a month of S&P 500 ever drop more than $1000?
       | 
       | Using the same procedure but with data from the S&P 500 index,
       | the central limit theorem suggests that a drop of more than
       | $1000, when looking at a time horizon of 30 days, happens about
       | 0.9 % of the time."
        
         | CrazyStat wrote:
         | Using the (log) returns would certainly be the usual way to
         | approach this.
        
       | CrazyStat wrote:
       | > So here's my radical proposal: instead of mindlessly fitting a
       | theoretical distribution onto your data, use the real data. We
       | have computers now, and they are really good at performing the
       | same operation over and over - exploit that. Resample from the
       | data you have!
       | 
       | Hmm.
       | 
       | The problem with fat-tailed distributions is that often the real
       | data doesn't actually show you how fat-tailed the distribution
       | is. The author notes this later, but doesn't discuss the
       | difficulty it creates for the suggestion of "using the real
       | data":
       | 
       | > In other words, a heavy-tailed distribution will generate a lot
       | of central values and look like a perfect bell curve for a long
       | time ... until it doesn't anymore. Annoyingly, the worse the tail
       | events, the longer it will keep looking like a friendly bell
       | curve.
       | 
       | This is all true, but the author misses the immediate corollary:
       | the worse the tail events, the longer using the real data will
       | mislead you into thinking the tails are _fine_. Using the real
       | data doesn 't solve the fat tails problem; in fact, resampling
       | from the real data is sampling from a bounded distribution with
       | _zero tails_.
        
       ___________________________________________________________________
       (page generated 2023-07-20 23:02 UTC)