codemadness.org

       bmf.1 - bmf - bmf (Bayesian Mail Filter) 0.9.4 fork + patches
 (HTM) git clone git://git.codemadness.org/bmf
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       bmf.1 (4854B)
       ---
            1 .\"Generated by db2man.xsl. Don't modify this, modify the source.
            2 .de Sh \" Subsection
            3 .br
            4 .if t .Sp
            5 .ne 5
            6 .PP
            7 \fB\\$1\fR
            8 .PP
            9 ..
           10 .de Sp \" Vertical space (when we can't use .PP)
           11 .if t .sp .5v
           12 .if n .sp
           13 ..
           14 .de Ip \" List item
           15 .br
           16 .ie \\n(.$>=3 .ne \\$3
           17 .el .ne 3
           18 .IP "\\$1" \\$2
           19 ..
           20 .TH "BMF" 1 "" "" ""
           21 .SH NAME
           22 bmf \- efficient Bayesian mail filter
           23 .SH "SYNOPSIS"
           24 
           25 .nf
           26 \fBbmf\fR [-b] [-t] [-n] [-s] [-N] [-S] [-d db] [-k n] [-m type] [-p]
           27     [-v] [-V] [-h]
           28 .fi
           29 
           30 .SH "DESCRIPTION"
           31 
           32 .PP
           33 bmf is a Bayesian mail filter. In its normal mode of operation, it takes an email message or other text on standard input, does a statistical check against lists of "good" and "spam" words, registers the new data, and returns a status code indicating whether or not the message is spam. BMF is written with fast, zero-copy algorithms, coded directly in C, and tuned for speed. It aims to be faster, smaller, and more versatile than similar applications.
           34 
           35 .PP
           36 bmf supports both mbox and maildir mail storage formats. It will automatically process multiple messages within an mbox file separately.
           37 
           38 .SH "OPTIONS"
           39 
           40 .PP
           41 Without command-line options, bmf processes the input, registers it as either "good" or "spam", and returns the appropriate error code. The wordlist directory and nonexistent wordfiles are created if absent.
           42 
           43 .PP
           44 \fB-b\fR Bulk test mode, read file list from stdin, output file, TAB, spamicity score per line.
           45 
           46 .PP
           47 \fB-t\fR Test to see if the input is spam. The word lists are not updated. A report is written to stdout showing the final score and the tokens with the highest deviation form a mean of 0.5.
           48 
           49 .PP
           50 \fB-n\fR Register the input as non-spam.
           51 
           52 .PP
           53 \fB-s\fR Register the input as spam.
           54 
           55 .PP
           56 \fB-N\fR Register the input as non-spam and undo a prior registration as spam.
           57 
           58 .PP
           59 \fB-S\fR Register the input as spam and undo a prior registration as non-spam.
           60 
           61 .PP
           62 \fB-d db\fR Specify database or directory for loading and saving word lists. The default is \fI~/.bmf\fR in text mode.
           63 
           64 .PP
           65 \fB-k n\fR Specify the number of extrema (keepers) to use in the Bayes calculation. The default is 15.
           66 
           67 .PP
           68 \fB-m fmt\fR Specify mail storage format. Valid formats are mbox and maildir. The default is to automatically detect the mail storage format. This option is deprecated.
           69 
           70 .PP
           71 \fB-p\fR Copy the input to the output (passthrough) and insert spam headers in the style of SpamAssassin. An X-Spam-Status header is always inserted with processing details. The contents of this header always begin with either "Yes" or "No". If the input is judged to be spam, the header "X-Spam-Flag: YES" is also inserted.
           72 
           73 .PP
           74 \fB-v\fR Be more verbose. This option is not well supported yet.
           75 
           76 .PP
           77 \fB-V\fR Display version information.
           78 
           79 .PP
           80 \fB-h\fR Display usage information.
           81 
           82 .SH "THEORY OF OPERATION"
           83 
           84 .PP
           85 bmf treats its input as a bag of tokens. Each token is checked against "good" and "bad" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furthest from average are combined using Bayes's theorem on conditional probabilities.
           86 
           87 .PP
           88 While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam: \fIhttp://www.paulgraham.com/spam.html\fR is recommended reading.
           89 
           90 .PP
           91 bmf improves on Paul's proposal by doing smarter lexical analysis. In particular, hostnames and IP addresses are not discarded, and certain types of MTA information are discarded (such as message ids and dates).
           92 
           93 .PP
           94 MIME and other attachments are not decoded. Experience from watching the token streams suggests that spam with enclosures invariably gives itself away through cues in the headers and non-enclosure parts. Nonetheless, I would like to add the ability to decode quoted-printable and perhaps base64 encodings for textual attachments.
           95 
           96 .SH "INTEGRATION WITH OTHER TOOLS"
           97 
           98 .PP
           99 Please see the README for samples and suggestions.
          100 
          101 .SH "RETURN VALUES"
          102 
          103 .PP
          104 In passthrough mode: zero for success, nonzero for failure.
          105 
          106 .PP
          107 In non-passthrough mode: 0 for spam; 1 for non-spam; 2 for I/O or other errors.
          108 
          109 .SH "FILES"
          110 
          111 .TP
          112 \fI~/.bmf/goodlist.txt\fR
          113 List of good tokens for text mode.
          114 
          115 .TP
          116 \fI~/.bmf/spamlist.txt\fR
          117 List of bad tokens for text mode.
          118 
          119 .SH "BUGS"
          120 
          121 .PP
          122 The lexer should recognize multiline headers.
          123 
          124 .PP
          125 The lexer should recognize MIME attachments.
          126 
          127 .PP
          128 Content-Transfer-Encoding is not decoded.
          129 
          130 .SH "AUTHOR"
          131 
          132 .PP
          133 Tom Marshall <tommy@tig-grr.com>.
          134 
          135 .PP
          136 The Bayes algorithm is from bogofilter by Eric S. Raymond <esr@thyrsus.com>. bogofilter can be found at the bogofilter project page: \fIhttp://bogofilter.sourceforge.net/\fR.
          137