Subj : Re: Naive Bayes Algorithm?
To   : comp.programming
From : Mike
Date : Wed Jul 20 2005 10:17 pm

On 2005-07-20, Richard Heathfield <invalid@address.co.uk.invalid> wrote:
> Mike wrote:
>
>> I'm doing this in perl and have a perl module from CPAN.
>> I want to understand how this algorithm works, though.
>> Does someone have a link, an explanation, psuedo-code,
>> etc.? It's been many years since I had any of this math
>> and though I can remember some of the symbols I try to
>> follow the mathematics and get myself confused.
>> 
>
> http://www.paulgraham.com/spam.html is the canonical link, I suppose.
>
> The general idea behind Bayesian filtering is as follows:
>
> Consider the normal probability spectrum, 0 to 1. 0 means "impossible", and 
> 1 means "certain". We wish to know the probability that a particular email 
> is spam. For brevity's sake, we will assume that emails consist entirely of 
> single letters of the alphabet.
>
> We start off with a "marking" phase, in which we identify to our filtering 
> program whether we consider an email to be spam. Our first email looks like 
> this:
>
> A B C D E F G H I
>
> We decide it's spam, so our marking goes like this:
>
>       Spam    Ham (i.e. not spam)
>   A    1       0
>   B    1       0
>   C    1       0
>   D    1       0
>   E    1       0
>   F    1       0
>   G    1       0
>   H    1       0
>   I    1       0
>
> Our second email is: A B C D F K L, and we decide it's "ham" - something we 
> want to read. Adding each token into our database, we now have:
>
>       Spam    Ham
>   A    1       1
>   B    1       1
>   C    1       1
>   D    1       1
>   E    1       0
>   F    1       1
>   G    1       0
>   H    1       0
>   I    1       0
>   K    0       1
>   L    0       1
>
> The third email is spam again, and this time it looks like this:
>
> C D F I K L M
>
> We update our database:
>
>       Spam    Ham
>   A    1       1
>   B    1       1
>   C    2       1
>   D    2       1
>   E    1       0
>   F    2       1
>   G    1       0
>   H    1       0
>   I    2       0
>   K    1       1
>   L    1       1
>   M    1       0
>
> After a few more emails, some ham, some spam, we end up with:
>
>       Spam    Ham
>   A    5       1
>   B    2       9
>   C    4       8
>   D    2       7
>   E    3       0
>   F    8       3
>   G   11       2
>   H    8       2
>   I    5       1
>   K    3       7
>   L    4       9
>   M    3       1
>
>
> (The more emails you use for "training", the better.)
>
>
> Now we're ready to filter. We receive an email like this:
>
> A B D G M N O P
>
> We look up each token in the database, and find that A, B, D, G, M all have 
> entries. We can't use N O P in our filtering, because we've never seen them 
> before, but we can use the rest. We build a table of probabilities:
>
> A's spam probability is 5/6 = 0.833. (Seen 6 times, of which 5 were spam)
> B's spam probability is 2/11 = 0.181.
> D's spam probability is 2/9 = 0.222.
> G's spam probability is 11/13 = 0.846.
> M's spam probability is 3/4 = 0.750.
>
> (If we have lots and lots of tokens, then we pick the N tokens whose 
> probabilities are furthest from 0.5 - we're looking for "very spammy" 
> tokens and "very hammy" tokens. But for now we'll use ALL our data, to work 
> the example more clearly.)
>
> The ham probabilities are of course 1 - spam_probability:
>
> A's ham probability is 0.167.
> B's ham probability is 0.819.
> D's ham probability is 0.778.
> G's ham probability is 0.154.
> M's ham probability is 0.250.
>
> We now calculate m / (d + m), where m is the product of all the spam 
> probabilities, and d is the product of all the ham probabilities. So, in 
> our example, we have:
>
> m = 0.833 * 0.181 * 0.222 * 0.846 * 0.750, which is 0.0212377
>
> d = 0.167 * 0.819 * 0.778 * 0.154 * 0.250, which is 0.00409676
>
> m / (d + m) is 0.0212377 / (0.0212377 + 0.00409676)
>
> which is 0.838 or so. So the Bayesian probability that the latest email is a 
> spam mail is around 84%. This is high, but probably not high enough to 
> justify binning the email. Typical Bayesian filters are set to reject stuff 
> only if it is 99% (or even 99.9%) likely to be spam, on the grounds that 
> it's better to wade through a little spam than to miss a potentially 
> important email.
>
> HTH. HAND.
>

What an excellent post, thank you.
I understand better now.
The perl module I have from CPAN understands both numerics and symbolics.
The example above is all symbolic. How could it be changed for numerics,
including reals, or are numerics and reals just a different kind of symbolic?

Mike

.