Subj : Re: Naive Bayes Algorithm?
To : comp.programming
From : Richard Heathfield
Date : Thu Jul 21 2005 07:20 am
Mike wrote:
> The perl module I have from CPAN understands both numerics and symbolics.
> The example above is all symbolic. How could it be changed for numerics,
> including reals, or are numerics and reals just a different kind of
> symbolic?
Where I used individual letters, you would substitute tokens - words,
numbers, whatever you like. How you decide to partition and interpret
tokens is up to you.
Personally, I would not bother to distinguish numerics and reals. I would
just treat them as character strings. I cannot see any reason to
distinguish them.
For maximum effectiveness, I would
(a) rip out and discard any HTML tags, since these are often used by
spammers in an attempt to disguise spam; Buy!
being a very simplistic example of what I mean;
(b) tokenise fairly aggressively on what remains; I would certainly use
normal punctuation in my delimiters, as well as the usual space/tab/newline
combination;
(c) definitely parse the headers, not just the body;
(d) treat every token as just another token, even if it's numeric in spirit;
(e) double the ham-counts, just to play safe.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
mail: rjh at above domain
.