Subj : Essay: Introduction to "Probability and Statistics" To : comp.programming,sci.math From : rem642b Date : Wed Jul 06 2005 12:00 am Unlike most of mathematics, where we merely compute with numbers or plan how numbers might be computed or figure out patterns of numerical computations or set up abstract structures and analyze their characteristics, which requires nothing more than a strongly disciplined mind and a desire to understand complex but totally consistent systems, probability and statistics requires a special mindset, a special way of thinking about nature. (Albert Einstein is famous for not having this mindset. I hope you all know his famous quote regarding quantum mechanics and metaphorical "dice"?) When studying probability and statistics, we must accept that we don't know ahead-of-time how experiments are going to turn out, at least not exactly, and even in principle there's no way to know absolutely for sure how an experiment will turn out. For experiments with a finite set of possible outcomes (I won't discuss infinite cases here in this absolute-beginner introduction) we have a list of possible outcomes of the experiment, but we don't know which of the various outcomes will actually occur when we perform the experiment. Of course after the experiment has been performed and we've observed the result, presumably at that time we know precisely which of the various outcomes really did occur. But before the experiment was performed, all we could know for sure was that each of the outcomes in our list is possible, i.e. it was impossible to cross one of the list as impossible. Anything impossible was never on the list in the first place, or we realized our mistake already and crossed it off long before doing the experiment. (It's possible that something is truly impossible in nature, but we don't *know* it's impossible, so we leave it on our list anyway. I might have time to discuss this topic later.) Before the experiment, besides the list of what's possible as an outcome, we think we have an idea about something else, namely how often each of the different outcomes would occur if we repeated the experiment many times. For example, if we flip a coin once, it'll come up heads, or it'll come up tails, one or the other, not half and half or any other middle ground betwee the two extremes. But if we flip a coin many times, or flip many coins once each, etc., we rather expect about half of the outcomes will be heads and about half tails. (In truth, even with an exactly balanced coin, this estimate is slightly wrong. A better prediction would include a third outcome, that something might prevent either heads or tails from coming up. For example, a bird might fly by and gulp the coin while it's still in the air. Or the coin might fall in a slit in a grate over a storm drain and be lost from our sight. Or the coin might land on its edge and roll down the sidewalk far away where we can't find it. Or the coin might land on its edge and just stay there indefinitely, and we declare neither heads nor tails rather than wait hours or days for a breeze to finally topple it one way or the other. But with a properly balanced coin, without any storm drain openings nearby, on level ground or inside (but not over a couch/sofa where we all know coins tend to fall in the crack and get lost for years, and not over a sink or toilet or the cat's litter box etc.), away from any large congregation of birds or other animals to snatch the coin, and over a hard non-sticky surface where coins tend to fall over rather than stick into wet mud etc., and away from any number of other problem situations that I haven't thought to include here, the half-heads half-tails estimate is extremely close to a correct prediction of future experiments. The study of how to best make such a prediction about the outcome before a single experiment, and/or estimating the longterm average of a series of identical experiments, and/or performing mathematical calculations to compute an estimate for a combination of experiments considered as a single outcome (*1*), is what is known as "Probability", and each of the predictions as to longterm repeat frequency of a single outcome is called the "probability of that outcome". The analysis of longterm averages after many experiments have already been performed and the data collected, is called "Statistics", and the dataset consisting of summary totals or fractions of any given series of experiments, is called the "statistics of that series of experiments". *1* (For example, if you flip two identical coins, under the assumption that the probability for each coin by itself is head 1/2, tail 1/2, with nothing causing the outcome of either coin to affect the other coin's outcome, what are the possible combination outcomes, and how often do you expect them to occur if you perform the flip-two-coins experiment many times? Answer: two-heads 1/4, head+tail 1/2, two-tails 1/4.) How are Probability, and Statistics, related? A priori estimates of probabilities can be used to predict the statistics that will later be generated from a series of experiments. If the statistics differ greatly from the a priori prediction, it tends to indicate that the a priori probabilities were incorrect, and that we should modify our a priori model to do a better job of predicting the statistics of future experiments. There are powerful statistical tests that are supposed to give us good advice whether to immediately try to modify our old theory, or whether the theory is "close enough" to correct and we should keep it as-is until and unless there's better evidence anything is wrong with it. The most famous of these statistical tests of how well statistics from a series of experiments matches ("fits" in Statistics jargon) the a priori mode, is called "Chi Squared". But in 1969 or 1970 I was very uncomfortable with the task of looking at the "Chi Squared" number and trying to figure out whether it was good or bad or whatnot, and so I worked out an alternate statistical test that I find much easier to undersand and interpret, at least for experiments where outcomes of a series of experiments can be added together and then divided by the total number of individual experiments to produce a single overall average which is an agglomerate outcome of a whole series of experiments. (This method is applicable whenever there are only two outcomes, such as coin flipping, or computer pseudo-random bit generators, because you can always treat one outocme as 0 and the other as 1 and then compute the total and average for any collection of experiments.) My new (*2*) test of statistical fit to model is based on "normalizing" the statistics so that the local density (*3*) of the probability-distrubition function is constant, so if it were graphed it would be a horizontal line if the model was correct (agreed with reality/nature) and some other non-straight graph if the model was wrong. But then I chose not to try drawing a line graph, which very difficult to get correct (computing derivative of function that's descrete in the first place so the derivative makes no sense really), and is a one-time thing on the CalComp 565 plotter on the computer I was using, no way to process some data then draw one preliminary line then include more data and erase old line and draw new line like you can on a CRT or LED display, or you spend a lot of time drawing many many successive graphs down the page wasting lots of paper, or you wait until very end before you know whether the program worked at all, but then you don't know how long to run the program. Instead I used the CalComp 565 plotter to fill an interval with marks for each event, and the visual density of these marks at any abscissa (*4*) matches the local density of the statistics, so if the image looks uniformly gray then the fit is good whereas if it looks lumpy then the fit is bad. This was really nice for the CalComp plotter because I could have the program immediately start drawing, so I'd see one mark, then a second added to it, then a third, etc. seeing the image gradualy build up, so I could tell immediately whether the program was working, and in the case of grossly lopsided output from the FORTRAN library I could very quickly see that parts were being filled while parts were being avoided, and after it had run longer I could see a clear image of which parts were dark and which were blank paper. Actually I plotted several different statistical tests, each independently normalized, stacked on top of each other. The X axis was the abscissa, while the Y axis selected the various tests in strips, and the density was the true ordinate (value being plotted, see that dictionary entry again). The tests were in a sequence, with the single-bit statistics (1/2 1/2) plotted across the bottom strip from, the two-bit combinations (1/4 1/2 1/4) as described earlier in the next-to-bottom strip, etc. upward to the top. I took time out from composing this test to compose this: http://www.rawbw.com/~rem/AsciiArt/StatisticalTest.txt an AsciiArt diagram of the various strips and the rectangles within each strip. Anyway, the question arises: Within each rectangle, where do I have the CalComp plotter draw the single vertical mark for each sample? I used successive integer multiples of the Golden Ratio, modulo 1, to determine the position (left=0 right=1) within each rectangle to draw the mark. *2* (Well it was new in 1970 when I used it to establish firmly that the IBM 1130 FORTRAN random-number generator was utterly horrible as a pseudo-random-bit generator) *3* (This gets into the relation between Calculus and statistics of infinite sample spaces over intervals, but I said I wasn't going to talk about infinite sample spaces, so I'm not going to explain precisely what I mean in this essay, so if you don't understand the idea even vaguely then please ignore my remarks about my new visual test of statistics/model fit.) *4* (See any good online dictionary or Analytic Geometry book. For example, online here: http://dictionary.reference.com/search?q=abscissa&db=* The horizontal or x coordinate on an (x, y) graph; the input of a function against which the output is plotted.) So what was the actual result of using my new method to test the IBM 1130 FORTRAN (pseudo)random number generator, taking various bit slices out of the 16-bit numbers generated. The low-order bit just alternated back and forth 0-1, so the bottom H/T strip was perfectly uniform but the second strip was 100% in the middle HT with not a single sample in TT or HH, etc. upward, resulting in a very skinny "monster" like the Chicago skyscraper topmost antenna. The bits continued to generate more complicated patterns as I moved from low-order toward high-order bit, but if I recall correctly even the highest-order bit still showed ugly non-uniformness. Overall my single-bit-slice 1-to-n-samples-grouped diagrams looked like weird Chess pieces, instead of the large unformly filled rectangles they would have been if the bits had been truly pseudo-random. So now I knew the FORTRAN library was worthless for the Monte Carlo application I had in mind. I didn't know about Knuth AOCP (Art Of Computer Programming), in particular his studies of pseudo-random number generators, so I hacked together something of my own design as a pseudo-random number generator. Basically it scanned RAM of the computer it was running on, computing some kind of continuous-stream hash function on it. The result when I tested it was nice uniform rectangles overall, each strip uniformely filled from left to right, with lots of vague clumpiness due to the finite sample, but all in all a success, a somewhat random number generator plenty good enough for my Monte Carlo application. The Hollerith cards containing my program, both the CalComp program and my hash-RAM prng, were lost in a warehouse fire in 1972, but the original CalComp plots happened to be in the trunk of my car at the time so they survived and I still have them in storage, so let me know if you're in the area and would like to make a trip with me to my storage locker to see those vintage plots. This essay copyright 2005 by Robert Elton Maas, all rights reserved. That was part 1 of the essay. Part 2, if anybody wants it, is my advice how to learn Probability and Statistics nowadays, *not* by taking a full class in a college/university, unless you really want to, or if it gets you credits towards satisfying the requirements for a degree. P.S. what got me inspired to write this essay, especially part 2, is this newsgroup article I was reading: http://www.google.se/groups?selm=11ckdg14bot9c8%40corp.supernews.com whereupon I suspended reading the article to write the essay immediately in lieu of posting a direct followup to the article. P.P.S., later in 1970 I realized that instead of using modular multiples of the Golden Ratio to decide which vertical scanline to use to fill the selected rectangle, I could use a second set of data, squeezed so the whole 0..1 strip of the second set fit within the single rectangle of the first data. But then I still need to decide which scanline within the nested interval. So then I use a third dataset squeezed to fit within the second set. In principle this could be carried on forever, yielding a Real Number as the intersection point of an infinite sequence of nested closed intervals. (Nested open intervals, or half-open intervals, don't necessarily have a non-empty intersection, but nested closed intervals are guaranteed to have an intersection point in any complete ordered metric space.) Then I realized that reading out the binary expression of that real number would yield perfect data compression, and using the bit sequence instead of the data to select the nested intervals could be used to reconstruct the original sequence of data, i.e. decompress the file. In 1970.Oct I was using the PDP-10 computer at the Stanford A.I. lab, with III (Information International Inc.) CRT-displays instead of CalComp plotter, to do some experiments related to that idea, when John McCarthy caught me and offered me a possible job writing a data-compression program if I would first make a prototype without pay to prove my ability. .