Subj : Re: Naive Bayes Algorithm? To : comp.programming From : Mike Date : Thu Jul 21 2005 11:26 pm On 2005-07-21, Richard Heathfield wrote: > Mike wrote: > >> The perl module I have from CPAN understands both numerics and symbolics. >> The example above is all symbolic. How could it be changed for numerics, >> including reals, or are numerics and reals just a different kind of >> symbolic? > > Where I used individual letters, you would substitute tokens - words, > numbers, whatever you like. How you decide to partition and interpret > tokens is up to you. > > Personally, I would not bother to distinguish numerics and reals. I would > just treat them as character strings. I cannot see any reason to > distinguish them. > > For maximum effectiveness, I would > > (a) rip out and discard any HTML tags, since these are often used by > spammers in an attempt to disguise spam; Buy! > being a very simplistic example of what I mean; > (b) tokenise fairly aggressively on what remains; I would certainly use > normal punctuation in my delimiters, as well as the usual space/tab/newline > combination; > (c) definitely parse the headers, not just the body; > (d) treat every token as just another token, even if it's numeric in spirit; > (e) double the ham-counts, just to play safe. > Ah, I see. I'm not using, wanting, the algorithm for studying spam/ham. I'm playing with data from work fed into a bayes algorithm (that's my goal) to see if I can use the bayes theorems to help with analysis at work. For the numbers is there a way to distinguish where some event is good and has a number of 7.1 versus an event that is bad with a number of 7.2? Treating the data as only strings does greatly simplify things. Mike .