Subj : Re: Naive Bayes Algorithm? To : comp.programming From : Richard Heathfield Date : Wed Jul 20 2005 04:01 am Mike wrote: > I'm doing this in perl and have a perl module from CPAN. > I want to understand how this algorithm works, though. > Does someone have a link, an explanation, psuedo-code, > etc.? It's been many years since I had any of this math > and though I can remember some of the symbols I try to > follow the mathematics and get myself confused. > http://www.paulgraham.com/spam.html is the canonical link, I suppose. The general idea behind Bayesian filtering is as follows: Consider the normal probability spectrum, 0 to 1. 0 means "impossible", and 1 means "certain". We wish to know the probability that a particular email is spam. For brevity's sake, we will assume that emails consist entirely of single letters of the alphabet. We start off with a "marking" phase, in which we identify to our filtering program whether we consider an email to be spam. Our first email looks like this: A B C D E F G H I We decide it's spam, so our marking goes like this: Spam Ham (i.e. not spam) A 1 0 B 1 0 C 1 0 D 1 0 E 1 0 F 1 0 G 1 0 H 1 0 I 1 0 Our second email is: A B C D F K L, and we decide it's "ham" - something we want to read. Adding each token into our database, we now have: Spam Ham A 1 1 B 1 1 C 1 1 D 1 1 E 1 0 F 1 1 G 1 0 H 1 0 I 1 0 K 0 1 L 0 1 The third email is spam again, and this time it looks like this: C D F I K L M We update our database: Spam Ham A 1 1 B 1 1 C 2 1 D 2 1 E 1 0 F 2 1 G 1 0 H 1 0 I 2 0 K 1 1 L 1 1 M 1 0 After a few more emails, some ham, some spam, we end up with: Spam Ham A 5 1 B 2 9 C 4 8 D 2 7 E 3 0 F 8 3 G 11 2 H 8 2 I 5 1 K 3 7 L 4 9 M 3 1 (The more emails you use for "training", the better.) Now we're ready to filter. We receive an email like this: A B D G M N O P We look up each token in the database, and find that A, B, D, G, M all have entries. We can't use N O P in our filtering, because we've never seen them before, but we can use the rest. We build a table of probabilities: A's spam probability is 5/6 = 0.833. (Seen 6 times, of which 5 were spam) B's spam probability is 2/11 = 0.181. D's spam probability is 2/9 = 0.222. G's spam probability is 11/13 = 0.846. M's spam probability is 3/4 = 0.750. (If we have lots and lots of tokens, then we pick the N tokens whose probabilities are furthest from 0.5 - we're looking for "very spammy" tokens and "very hammy" tokens. But for now we'll use ALL our data, to work the example more clearly.) The ham probabilities are of course 1 - spam_probability: A's ham probability is 0.167. B's ham probability is 0.819. D's ham probability is 0.778. G's ham probability is 0.154. M's ham probability is 0.250. We now calculate m / (d + m), where m is the product of all the spam probabilities, and d is the product of all the ham probabilities. So, in our example, we have: m = 0.833 * 0.181 * 0.222 * 0.846 * 0.750, which is 0.0212377 d = 0.167 * 0.819 * 0.778 * 0.154 * 0.250, which is 0.00409676 m / (d + m) is 0.0212377 / (0.0212377 + 0.00409676) which is 0.838 or so. So the Bayesian probability that the latest email is a spam mail is around 84%. This is high, but probably not high enough to justify binning the email. Typical Bayesian filters are set to reject stuff only if it is 99% (or even 99.9%) likely to be spam, on the grounds that it's better to wade through a little spam than to miss a potentially important email. HTH. HAND. -- Richard Heathfield "Usenet is a strange place" - dmr 29/7/1999 http://www.cpax.org.uk mail: rjh at above domain .