Subj : Re: Naive Bayes Algorithm?
To   : comp.programming
From : Mike
Date : Thu Jul 21 2005 11:26 pm

On 2005-07-21, Richard Heathfield <invalid@address.co.uk.invalid> wrote:
> Mike wrote:
>
>> The perl module I have from CPAN understands both numerics and symbolics.
>> The example above is all symbolic. How could it be changed for numerics,
>> including reals, or are numerics and reals just a different kind of
>> symbolic?
>
> Where I used individual letters, you would substitute tokens - words, 
> numbers, whatever you like. How you decide to partition and interpret 
> tokens is up to you.
>
> Personally, I would not bother to distinguish numerics and reals. I would 
> just treat them as character strings. I cannot see any reason to 
> distinguish them.
>
> For maximum effectiveness, I would
>
> (a) rip out and discard any HTML tags, since these are often used by 
> spammers in an attempt to disguise spam; <b></b>B<b></b>u<b></b>y<b></b>! 
> being a very simplistic example of what I mean;
> (b) tokenise fairly aggressively on what remains; I would certainly use 
> normal punctuation in my delimiters, as well as the usual space/tab/newline 
> combination;
> (c) definitely parse the headers, not just the body;
> (d) treat every token as just another token, even if it's numeric in spirit;
> (e) double the ham-counts, just to play safe.
>

Ah, I see. I'm not using, wanting, the algorithm for studying spam/ham.
I'm playing with data from work fed into a bayes algorithm (that's my
goal) to see if I can use the bayes theorems to help with analysis
at work.

For the numbers is there a way to distinguish where some event is good
and has a number of 7.1 versus an event that is bad with a number of 7.2?
Treating the data as only strings does greatly simplify things.

Mike

.