Subj : Re: Naive Bayes Algorithm? To : comp.programming From : Mike Date : Wed Jul 20 2005 10:17 pm On 2005-07-20, Richard Heathfield wrote: > Mike wrote: > >> I'm doing this in perl and have a perl module from CPAN. >> I want to understand how this algorithm works, though. >> Does someone have a link, an explanation, psuedo-code, >> etc.? It's been many years since I had any of this math >> and though I can remember some of the symbols I try to >> follow the mathematics and get myself confused. >> > > http://www.paulgraham.com/spam.html is the canonical link, I suppose. > > The general idea behind Bayesian filtering is as follows: > > Consider the normal probability spectrum, 0 to 1. 0 means "impossible", and > 1 means "certain". We wish to know the probability that a particular email > is spam. For brevity's sake, we will assume that emails consist entirely of > single letters of the alphabet. > > We start off with a "marking" phase, in which we identify to our filtering > program whether we consider an email to be spam. Our first email looks like > this: > > A B C D E F G H I > > We decide it's spam, so our marking goes like this: > > Spam Ham (i.e. not spam) > A 1 0 > B 1 0 > C 1 0 > D 1 0 > E 1 0 > F 1 0 > G 1 0 > H 1 0 > I 1 0 > > Our second email is: A B C D F K L, and we decide it's "ham" - something we > want to read. Adding each token into our database, we now have: > > Spam Ham > A 1 1 > B 1 1 > C 1 1 > D 1 1 > E 1 0 > F 1 1 > G 1 0 > H 1 0 > I 1 0 > K 0 1 > L 0 1 > > The third email is spam again, and this time it looks like this: > > C D F I K L M > > We update our database: > > Spam Ham > A 1 1 > B 1 1 > C 2 1 > D 2 1 > E 1 0 > F 2 1 > G 1 0 > H 1 0 > I 2 0 > K 1 1 > L 1 1 > M 1 0 > > After a few more emails, some ham, some spam, we end up with: > > Spam Ham > A 5 1 > B 2 9 > C 4 8 > D 2 7 > E 3 0 > F 8 3 > G 11 2 > H 8 2 > I 5 1 > K 3 7 > L 4 9 > M 3 1 > > > (The more emails you use for "training", the better.) > > > Now we're ready to filter. We receive an email like this: > > A B D G M N O P > > We look up each token in the database, and find that A, B, D, G, M all have > entries. We can't use N O P in our filtering, because we've never seen them > before, but we can use the rest. We build a table of probabilities: > > A's spam probability is 5/6 = 0.833. (Seen 6 times, of which 5 were spam) > B's spam probability is 2/11 = 0.181. > D's spam probability is 2/9 = 0.222. > G's spam probability is 11/13 = 0.846. > M's spam probability is 3/4 = 0.750. > > (If we have lots and lots of tokens, then we pick the N tokens whose > probabilities are furthest from 0.5 - we're looking for "very spammy" > tokens and "very hammy" tokens. But for now we'll use ALL our data, to work > the example more clearly.) > > The ham probabilities are of course 1 - spam_probability: > > A's ham probability is 0.167. > B's ham probability is 0.819. > D's ham probability is 0.778. > G's ham probability is 0.154. > M's ham probability is 0.250. > > We now calculate m / (d + m), where m is the product of all the spam > probabilities, and d is the product of all the ham probabilities. So, in > our example, we have: > > m = 0.833 * 0.181 * 0.222 * 0.846 * 0.750, which is 0.0212377 > > d = 0.167 * 0.819 * 0.778 * 0.154 * 0.250, which is 0.00409676 > > m / (d + m) is 0.0212377 / (0.0212377 + 0.00409676) > > which is 0.838 or so. So the Bayesian probability that the latest email is a > spam mail is around 84%. This is high, but probably not high enough to > justify binning the email. Typical Bayesian filters are set to reject stuff > only if it is 99% (or even 99.9%) likely to be spam, on the grounds that > it's better to wade through a little spam than to miss a potentially > important email. > > HTH. HAND. > What an excellent post, thank you. I understand better now. The perl module I have from CPAN understands both numerics and symbolics. The example above is all symbolic. How could it be changed for numerics, including reals, or are numerics and reals just a different kind of symbolic? Mike .