[HN Gopher] It Takes Long to Become Gaussian
___________________________________________________________________
It Takes Long to Become Gaussian
Author : kqr
Score : 43 points
Date : 2023-07-20 05:36 UTC (17 hours ago)
(HTM) web link (two-wrongs.com)
(TXT) w3m dump (two-wrongs.com)
| jerf wrote:
| Very important article. Software engineering also has a lot of
| fat-tailed distributions that can masquerade as normal
| distributions. Our intuition tends to assume normal distributions
| to one degree or another and we must learn when to disregard it.
|
| One example that is both very important to software engineers and
| will hit close to home is effort estimations. It is a super fat-
| tailed distribution. The problem is that a lot of tools managers
| will use to try to make schedules implicitly assume that errors
| will essentially be Gaussian. I can draw this fancy Gantt chart
| because OK, sure, task 26 may be a couple of days late, but maybe
| task 24 will be a couple of days early. As we all know from real
| life, tasks are effectively _never_ early, and it 's completely
| expected that at least one task from any non-trivial project will
| be budgeted for 2 weeks in the early planning stage but actually
| take 4 months. Managers are driven batty by this because
| intuitively this feels like a 7 or 8 sigma event... but in real
| life, it isn't, it's so predictable that I just predicted it in
| your project, sight unseen, even though I know nothing about your
| project other than the fact it is "nontrivial".
|
| But there's a ton of places in software engineering, especially
| at any kind of scale, you must learn to discard Gaussian
| assumptions and deal with fatter tailed distributions. I get a
| lot of mileage out of the ideas in this essay. It's hard to give
| a really concrete guide to how to do that, but it's a very
| valuable skill if you can figure out how to learn it.
| angarg12 wrote:
| It might sound silly when spoken out loud, but I found the
| following example quite enlightening about this phenomenon:
| imagine that you want to drive to the airport, and you need to
| estimate how long it will take you to get there. You can simply
| measure how long it will take you to drive by dividing average
| speed by distance.
|
| If you actually drove there many times and measured your trip
| length, most of the time you would end up being later,
| sometimes horribly so.
|
| Why is that? The explanation is actually trivial: no matter how
| well things work, there is an absolute minimum time required to
| get to the airport. On the other hand, there are many things
| that might delay your trip: heavy traffic, closed roads,
| accidents...
|
| So, on the aggregate, there are very few things that can "go
| well", and they will reduce your trip time only a bit. On the
| other hand there are many things that can go wrong and make
| your trip last much, much longer.
|
| That's a sort of intuitive explanation of why things like
| software estimates are fat-tailed instead of normal.
| taeric wrote:
| Ironically, that anecdote also paints a good reason to sample
| the data on how long it takes to get there. If you check your
| own history for how long it takes to get somewhere, it is
| probably far more predictable than you'd expect.
|
| Of course, this also explains why software estimates are
| hard. We are all too often estimating things we haven't done
| before.
| prometheus76 wrote:
| I have been working in custom steel fabrication for most of my
| career, and our distributions are all fat-tail distributions.
| The main trick we've learned is to focus on and obsess over
| outliers at every stage of moving through our shop. If
| something has been sitting for more than three days, it's
| brought up in every meeting until it's moving again.
|
| The hardest part to communicate to accountant types and
| management types (especially if they come to our company from a
| manufacturing background) is that our system is and always will
| be a chaotic system, and therefore inherently unpredictable.
|
| The reason we are a chaotic system is because the base element
| in our system is a person, and people have inherently chaotic
| productivity output.
|
| Another major contributor to the chaos is the Pareto
| distribution among workers of productivity. When scheduling
| plans a task to take 10 hours, what they can't account for is
| that if employee A does it, it will take 4 hours, and if
| employee B does it, it will take 25 hours.
|
| I could go on and on with other layers of complexity that
| create long fat-tail distributions for us, but you get the
| point.
| cozzyd wrote:
| And be especially careful if your distribution is bounded and you
| are somewhat near the bound.... (Gammas come in handy)
| nequo wrote:
| This, along with the conclusion, is wrong in the first example:
| N(m=30x0.2,s=30x0.9). With these parameters, 16 MB
| should have a z score of about 2, and my computer Using slightly
| more precise numbers. helpfully tells me it's 2.38, which under
| the normal distribution corresponds to a probability of 0.8 %.
| [...] The last point is the most important one for the
| question we set out with. The actual risk of the files not
| fitting is 6x higher than the central limit theorem would have us
| believe.
|
| The probability implied by the normal distribution is 2.1%, not
| 0.8%: > 1 - pnorm(16, 30 * 0.2, sqrt(30) * 0.9)
| [1] 0.02124942
|
| So the actual probability of the files not fitting is only 2.26
| times higher than implied probability. I think this is actually
| quite impressive. At n = 30, we are talking about a very small
| sample, one for which we don't usually use the central limit
| theorem.
| thebooktocome wrote:
| To be clear, the recommendation of N=30 came from the LessWrong
| article the author is critiquing, not the author themselves.
| cb321 wrote:
| https://link.springer.com/content/pdf/10.1007/3-540-59222-9_...
| kubeia wrote:
| From the Statistical Consequences of Fat Tails (N. Taleb):
|
| "We will address ad nauseam the central limit theorem but here is
| the initial intuition. It states that n-summed independent random
| variables with finite second moment end up looking like a
| Gaussian distribution. Nice story, but how fast? Power laws on
| paper need an infinity of such summands, meaning they never
| really reach the Gaussian. Chapter 7 deals with the limiting
| distributions and answers the central question: "how fast?" both
| for CLT and LLN. How fast is a big deal because in the real world
| we have something different from n equals infinity."
|
| https://arxiv.org/abs/2001.10488
| Bostonian wrote:
| The author's stock market example fails to talk about percentage
| returns and makes no sense to me.
|
| "Will a month of S&P 500 ever drop more than $1000?
|
| Using the same procedure but with data from the S&P 500 index,
| the central limit theorem suggests that a drop of more than
| $1000, when looking at a time horizon of 30 days, happens about
| 0.9 % of the time."
| CrazyStat wrote:
| Using the (log) returns would certainly be the usual way to
| approach this.
| CrazyStat wrote:
| > So here's my radical proposal: instead of mindlessly fitting a
| theoretical distribution onto your data, use the real data. We
| have computers now, and they are really good at performing the
| same operation over and over - exploit that. Resample from the
| data you have!
|
| Hmm.
|
| The problem with fat-tailed distributions is that often the real
| data doesn't actually show you how fat-tailed the distribution
| is. The author notes this later, but doesn't discuss the
| difficulty it creates for the suggestion of "using the real
| data":
|
| > In other words, a heavy-tailed distribution will generate a lot
| of central values and look like a perfect bell curve for a long
| time ... until it doesn't anymore. Annoyingly, the worse the tail
| events, the longer it will keep looking like a friendly bell
| curve.
|
| This is all true, but the author misses the immediate corollary:
| the worse the tail events, the longer using the real data will
| mislead you into thinking the tails are _fine_. Using the real
| data doesn 't solve the fat tails problem; in fact, resampling
| from the real data is sampling from a bounded distribution with
| _zero tails_.
___________________________________________________________________
(page generated 2023-07-20 23:02 UTC)