Subj : Re: Naive Bayes Algorithm?
To   : comp.programming
From : Richard Heathfield
Date : Wed Jul 20 2005 04:01 am

Mike wrote:

> I'm doing this in perl and have a perl module from CPAN.
> I want to understand how this algorithm works, though.
> Does someone have a link, an explanation, psuedo-code,
> etc.? It's been many years since I had any of this math
> and though I can remember some of the symbols I try to
> follow the mathematics and get myself confused.
> 

http://www.paulgraham.com/spam.html is the canonical link, I suppose.

The general idea behind Bayesian filtering is as follows:

Consider the normal probability spectrum, 0 to 1. 0 means "impossible", and 
1 means "certain". We wish to know the probability that a particular email 
is spam. For brevity's sake, we will assume that emails consist entirely of 
single letters of the alphabet.

We start off with a "marking" phase, in which we identify to our filtering 
program whether we consider an email to be spam. Our first email looks like 
this:

A B C D E F G H I

We decide it's spam, so our marking goes like this:

      Spam    Ham (i.e. not spam)
  A    1       0
  B    1       0
  C    1       0
  D    1       0
  E    1       0
  F    1       0
  G    1       0
  H    1       0
  I    1       0

Our second email is: A B C D F K L, and we decide it's "ham" - something we 
want to read. Adding each token into our database, we now have:

      Spam    Ham
  A    1       1
  B    1       1
  C    1       1
  D    1       1
  E    1       0
  F    1       1
  G    1       0
  H    1       0
  I    1       0
  K    0       1
  L    0       1

The third email is spam again, and this time it looks like this:

C D F I K L M

We update our database:

      Spam    Ham
  A    1       1
  B    1       1
  C    2       1
  D    2       1
  E    1       0
  F    2       1
  G    1       0
  H    1       0
  I    2       0
  K    1       1
  L    1       1
  M    1       0

After a few more emails, some ham, some spam, we end up with:

      Spam    Ham
  A    5       1
  B    2       9
  C    4       8
  D    2       7
  E    3       0
  F    8       3
  G   11       2
  H    8       2
  I    5       1
  K    3       7
  L    4       9
  M    3       1


(The more emails you use for "training", the better.)


Now we're ready to filter. We receive an email like this:

A B D G M N O P

We look up each token in the database, and find that A, B, D, G, M all have 
entries. We can't use N O P in our filtering, because we've never seen them 
before, but we can use the rest. We build a table of probabilities:

A's spam probability is 5/6 = 0.833. (Seen 6 times, of which 5 were spam)
B's spam probability is 2/11 = 0.181.
D's spam probability is 2/9 = 0.222.
G's spam probability is 11/13 = 0.846.
M's spam probability is 3/4 = 0.750.

(If we have lots and lots of tokens, then we pick the N tokens whose 
probabilities are furthest from 0.5 - we're looking for "very spammy" 
tokens and "very hammy" tokens. But for now we'll use ALL our data, to work 
the example more clearly.)

The ham probabilities are of course 1 - spam_probability:

A's ham probability is 0.167.
B's ham probability is 0.819.
D's ham probability is 0.778.
G's ham probability is 0.154.
M's ham probability is 0.250.

We now calculate m / (d + m), where m is the product of all the spam 
probabilities, and d is the product of all the ham probabilities. So, in 
our example, we have:

m = 0.833 * 0.181 * 0.222 * 0.846 * 0.750, which is 0.0212377

d = 0.167 * 0.819 * 0.778 * 0.154 * 0.250, which is 0.00409676

m / (d + m) is 0.0212377 / (0.0212377 + 0.00409676)

which is 0.838 or so. So the Bayesian probability that the latest email is a 
spam mail is around 84%. This is high, but probably not high enough to 
justify binning the email. Typical Bayesian filters are set to reject stuff 
only if it is 99% (or even 99.9%) likely to be spam, on the grounds that 
it's better to wade through a little spam than to miss a potentially 
important email.

HTH. HAND.

-- 
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
mail: rjh at above domain

.