Subj : Re: Naive Bayes Algorithm? To : comp.programming From : Richard Heathfield Date : Thu Jul 21 2005 07:20 am Mike wrote: > The perl module I have from CPAN understands both numerics and symbolics. > The example above is all symbolic. How could it be changed for numerics, > including reals, or are numerics and reals just a different kind of > symbolic? Where I used individual letters, you would substitute tokens - words, numbers, whatever you like. How you decide to partition and interpret tokens is up to you. Personally, I would not bother to distinguish numerics and reals. I would just treat them as character strings. I cannot see any reason to distinguish them. For maximum effectiveness, I would (a) rip out and discard any HTML tags, since these are often used by spammers in an attempt to disguise spam; Buy! being a very simplistic example of what I mean; (b) tokenise fairly aggressively on what remains; I would certainly use normal punctuation in my delimiters, as well as the usual space/tab/newline combination; (c) definitely parse the headers, not just the body; (d) treat every token as just another token, even if it's numeric in spirit; (e) double the ham-counts, just to play safe. -- Richard Heathfield "Usenet is a strange place" - dmr 29/7/1999 http://www.cpax.org.uk mail: rjh at above domain .