Subj : Essay: Introduction to "Probability and Statistics"
To   : comp.programming,sci.math
From : rem642b
Date : Wed Jul 06 2005 12:00 am

Unlike most of mathematics, where we merely compute with numbers or
plan how numbers might be computed or figure out patterns of numerical
computations or set up abstract structures and analyze their
characteristics, which requires nothing more than a strongly
disciplined mind and a desire to understand complex but totally
consistent systems, probability and statistics requires a special
mindset, a special way of thinking about nature. (Albert Einstein is
famous for not having this mindset. I hope you all know his famous
quote regarding quantum mechanics and metaphorical "dice"?)

When studying probability and statistics, we must accept that we don't
know ahead-of-time how experiments are going to turn out, at least not
exactly, and even in principle there's no way to know absolutely for
sure how an experiment will turn out. For experiments with a finite set
of possible outcomes (I won't discuss infinite cases here in this
absolute-beginner introduction) we have a list of possible outcomes of
the experiment, but we don't know which of the various outcomes will
actually occur when we perform the experiment. Of course after the
experiment has been performed and we've observed the result, presumably
at that time we know precisely which of the various outcomes really did
occur. But before the experiment was performed, all we could know for
sure was that each of the outcomes in our list is possible, i.e. it was
impossible to cross one of the list as impossible. Anything impossible
was never on the list in the first place, or we realized our mistake
already and crossed it off long before doing the experiment. (It's
possible that something is truly impossible in nature, but we don't
*know* it's impossible, so we leave it on our list anyway. I might have
time to discuss this topic later.)

Before the experiment, besides the list of what's possible as an
outcome, we think we have an idea about something else, namely how
often each of the different outcomes would occur if we repeated the
experiment many times. For example, if we flip a coin once, it'll come
up heads, or it'll come up tails, one or the other, not half and half
or any other middle ground betwee the two extremes. But if we flip a
coin many times, or flip many coins once each, etc., we rather expect
about half of the outcomes will be heads and about half tails. (In
truth, even with an exactly balanced coin, this estimate is slightly
wrong. A better prediction would include a third outcome, that
something might prevent either heads or tails from coming up. For
example, a bird might fly by and gulp the coin while it's still in the
air. Or the coin might fall in a slit in a grate over a storm drain and
be lost from our sight. Or the coin might land on its edge and roll
down the sidewalk far away where we can't find it. Or the coin might
land on its edge and just stay there indefinitely, and we declare
neither heads nor tails rather than wait hours or days for a breeze to
finally topple it one way or the other. But with a properly balanced
coin, without any storm drain openings nearby, on level ground or
inside (but not over a couch/sofa where we all know coins tend to fall
in the crack and get lost for years, and not over a sink or toilet or
the cat's litter box etc.), away from any large congregation of birds
or other animals to snatch the coin, and over a hard non-sticky surface
where coins tend to fall over rather than stick into wet mud etc., and
away from any number of other problem situations that I haven't thought
to include here, the half-heads half-tails estimate is extremely close
to a correct prediction of future experiments.

The study of how to best make such a prediction about the outcome
before a single experiment, and/or estimating the longterm average of a
series of identical experiments, and/or performing mathematical
calculations to compute an estimate for a combination of experiments
considered as a single outcome (*1*), is what is known as
"Probability", and each of the predictions as to longterm repeat
frequency of a single outcome is called the "probability of that
outcome". The analysis of longterm averages after many experiments have
already been performed and the data collected, is called "Statistics",
and the dataset consisting of summary totals or fractions of any given
series of experiments, is called the "statistics of that series of
experiments".

*1* (For example, if you flip two identical coins, under the assumption
      that the probability for each coin by itself is head 1/2, tail
      1/2, with nothing causing the outcome of either coin to affect
      the other coin's outcome, what are the possible combination
      outcomes, and how often do you expect them to occur if you
      perform the flip-two-coins experiment many times?
     Answer: two-heads 1/4, head+tail 1/2, two-tails 1/4.)

How are Probability, and Statistics, related? A priori estimates of
probabilities can be used to predict the statistics that will later be
generated from a series of experiments. If the statistics differ
greatly from the a priori prediction, it tends to indicate that the a
priori probabilities were incorrect, and that we should modify our a
priori model to do a better job of predicting the statistics of future
experiments. There are powerful statistical tests that are supposed to
give us good advice whether to immediately try to modify our old
theory, or whether the theory is "close enough" to correct and we
should keep it as-is until and unless there's better evidence anything
is wrong with it. The most famous of these statistical tests of how
well statistics from a series of experiments matches ("fits" in
Statistics jargon) the a priori mode, is called "Chi Squared". But in
1969 or 1970 I was very uncomfortable with the task of looking at the
"Chi Squared" number and trying to figure out whether it was good or
bad or whatnot, and so I worked out an alternate statistical test that
I find much easier to undersand and interpret, at least for experiments
where outcomes of a series of experiments can be added together and
then divided by the total number of individual experiments to produce a
single overall average which is an agglomerate outcome of a whole
series of experiments. (This method is applicable whenever there are
only two outcomes, such as coin flipping, or computer pseudo-random bit
generators, because you can always treat one outocme as 0 and the other
as 1 and then compute the total and average for any collection of
experiments.) My new (*2*) test of statistical fit to model is based on
"normalizing" the statistics so that the local density (*3*) of the
probability-distrubition function is constant, so if it were graphed it
would be a horizontal line if the model was correct (agreed with
reality/nature) and some other non-straight graph if the model was
wrong. But then I chose not to try drawing a line graph, which very
difficult to get correct (computing derivative of function that's
descrete in the first place so the derivative makes no sense really),
and is a one-time thing on the CalComp 565 plotter on the computer I
was using, no way to process some data then draw one preliminary line
then include more data and erase old line and draw new line like you
can on a CRT or LED display, or you spend a lot of time drawing many
many successive graphs down the page wasting lots of paper, or you wait
until very end before you know whether the program worked at all, but
then you don't know how long to run the program. Instead I used the
CalComp 565 plotter to fill an interval with marks for each event, and
the visual density of these marks at any abscissa (*4*) matches the
local density of the statistics, so if the image looks uniformly gray
then the fit is good whereas if it looks lumpy then the fit is bad.
This was really nice for the CalComp plotter because I could have the
program immediately start drawing, so I'd see one mark, then a second
added to it, then a third, etc. seeing the image gradualy build up, so
I could tell immediately whether the program was working, and in the
case of grossly lopsided output from the FORTRAN library I could very
quickly see that parts were being filled while parts were being
avoided, and after it had run longer I could see a clear image of which
parts were dark and which were blank paper. Actually I plotted several
different statistical tests, each independently normalized, stacked on
top of each other. The X axis was the abscissa, while the Y axis
selected the various tests in strips, and the density was the true
ordinate (value being plotted, see that dictionary entry again). The
tests were in a sequence, with the single-bit statistics (1/2 1/2)
plotted across the bottom strip from, the two-bit combinations (1/4 1/2
1/4) as described earlier in the next-to-bottom strip, etc. upward to
the top. I took time out from composing this test to compose this:
  http://www.rawbw.com/~rem/AsciiArt/StatisticalTest.txt
an AsciiArt diagram of the various strips and the rectangles within
each strip. Anyway, the question arises: Within each rectangle, where
do I have the CalComp plotter draw the single vertical mark for each
sample? I used successive integer multiples of the Golden Ratio, modulo
1, to determine the position (left=0 right=1) within each rectangle to
draw the mark.

*2* (Well it was new in 1970 when I used it to establish firmly that
      the IBM 1130 FORTRAN random-number generator was utterly horrible
      as a pseudo-random-bit generator)

*3* (This gets into the relation between Calculus and statistics of
      infinite sample spaces over intervals, but I said I wasn't going
      to talk about infinite sample spaces, so I'm not going to explain
      precisely what I mean in this essay, so if you don't understand
      the idea even vaguely then please ignore my remarks about my new
      visual test of statistics/model fit.)

*4* (See any good online dictionary or Analytic Geometry book.
     For example, online here:
  http://dictionary.reference.com/search?q=abscissa&db=*
   <mathematics> The horizontal or x coordinate on an (x, y)
   graph; the input of a function against which the output is
   plotted.)

So what was the actual result of using my new method to test the IBM
1130 FORTRAN (pseudo)random number generator, taking various bit slices
out of the 16-bit numbers generated. The low-order bit just alternated
back and forth 0-1, so the bottom H/T strip was perfectly uniform but
the second strip was 100% in the middle HT with not a single sample in
TT or HH, etc. upward, resulting in a very skinny "monster" like the
Chicago skyscraper topmost antenna. The bits continued to generate more
complicated patterns as I moved from low-order toward high-order bit,
but if I recall correctly even the highest-order bit still showed ugly
non-uniformness. Overall my single-bit-slice 1-to-n-samples-grouped
diagrams looked like weird Chess pieces, instead of the large unformly
filled rectangles they would have been if the bits had been truly
pseudo-random.

So now I knew the FORTRAN library was worthless for the Monte Carlo
application I had in mind. I didn't know about Knuth AOCP (Art Of
Computer Programming), in particular his studies of pseudo-random
number generators, so I hacked together something of my own design as a
pseudo-random number generator. Basically it scanned RAM of the
computer it was running on, computing some kind of continuous-stream
hash function on it. The result when I tested it was nice uniform
rectangles overall, each strip uniformely filled from left to right,
with lots of vague clumpiness due to the finite sample, but all in all
a success, a somewhat random number generator plenty good enough for my
Monte Carlo application.

The Hollerith cards containing my program, both the CalComp program and
my hash-RAM prng, were lost in a warehouse fire in 1972, but the
original CalComp plots happened to be in the trunk of my car at the
time so they survived and I still have them in storage, so let me know
if you're in the area and would like to make a trip with me to my
storage locker to see those vintage plots.

This essay copyright 2005 by Robert Elton Maas, all rights reserved.

That was part 1 of the essay. Part 2, if anybody wants it, is my advice
how to learn Probability and Statistics nowadays, *not* by taking a
full class in a college/university, unless you really want to, or if it
gets you credits towards satisfying the requirements for a degree.

P.S. what got me inspired to write this essay, especially part 2, is
this newsgroup article I was reading:
  http://www.google.se/groups?selm=11ckdg14bot9c8%40corp.supernews.com
whereupon I suspended reading the article to write the essay
immediately in lieu of posting a direct followup to the article.

P.P.S., later in 1970 I realized that instead of using modular
multiples of the Golden Ratio to decide which vertical scanline to use
to fill the selected rectangle, I could use a second set of data,
squeezed so the whole 0..1 strip of the second set fit within the
single rectangle of the first data. But then I still need to decide
which scanline within the nested interval. So then I use a third
dataset squeezed to fit within the second set. In principle this could
be carried on forever, yielding a Real Number as the intersection point
of an infinite sequence of nested closed intervals. (Nested open
intervals, or half-open intervals, don't necessarily have a non-empty
intersection, but nested closed intervals are guaranteed to have an
intersection point in any complete ordered metric space.) Then I
realized that reading out the binary expression of that real number
would yield perfect data compression, and using the bit sequence
instead of the data to select the nested intervals could be used to
reconstruct the original sequence of data, i.e. decompress the file. In
1970.Oct I was using the PDP-10 computer at the Stanford A.I. lab, with
III (Information International Inc.) CRT-displays instead of CalComp
plotter, to do some experiments related to that idea, when John
McCarthy caught me and offered me a possible job writing a
data-compression program if I would first make a prototype without pay
to prove my ability.

.