Sunday 27 May 2007

How to Configure SpamAssassin Bayesian Filter to Work with Exchange

Some organisations may have a Unix email gateway server and choose SpamAssassin to filter out the spam emails before the Internet email get delivered to the Exchange system. To achieve better spam filter result, you need to use the Bayesian filter and feed it with spam and ham. However, manually pulling spam and ham out of the Exchange mailboxes and import them to train the Bayesian filter can be a fairly time consuming process.

You can create two folders (one for spam, one for legitimate email) in the Public Folder and ask people to put spam and ham to the folders. Then, use a Perl script to pull all the spam and ham out to train the Bayesian filter each time you run this script. This way, everybody in your organisation can put the spam he/she received to the spam folder in the Public Folder.


Figure 1

Then you can put the Perl script (learn-spam.pl) to the SpamAssassin server and modify it a bit to work for you. (I found this script in a forum on the Internet.)


Please change the following accordingly.

$imapserver = "YOUR IMAP SERVER";

-uid="USERNAME"
-pwd="PASSWORD"

learn_mail ($HOME."/spam/", ".spam", "This is SPAM/", 1, "--spam --showdots")
learn_mail ($HOME."/ham/", ".ham", "Legitimate Email/", 1, "--ham --showdots");
----------------------------------------------------------
#!/usr/bin/perl -w
use strict;
use Mail::IMAPClient;
use Shell;
use Env qw(HOME);
use Getopt::Long;

use File::Temp qw/ tempfile tempdir /;

my $imapserver = "exchangeserver.allaboutexchange.net";

# set to 1 to enable imapclient debugging
my $debug = 0;

# set to 1 if running under cron (disables output)
my $cron = 0;

my $filename;
my $fh;

my %options =
(
uid => "username",
pwd => "password"
);

my $cmdsts = GetOptions ("uid=s" => \$options{uid}, "pwd=s" =>
\$options{pwd});

if (!$options {uid}) { die "[SPAMASSASSIN] uid not set
(-uid=username)\n"; }
if (!$options {pwd}) { die "[SPAMASSASSIN] pwd not set
(-pwd=password)\n"; }

my $uid = $options{uid};
my $pwd = $options{pwd};

# login to imap server
my $imap = Mail::IMAPClient->new (Server=>$imapserver, User=>$uid,
Password=>$pwd, Debug=>$debug)
or die "Can't connect to $uid\@$imapserver: $@ $\n";

if ($imap)
{
my $count;

# Deal with spam first
learn_mail ($HOME."/spam/", ".spam", "This is SPAM/", 1, "--spam --showdots");

# Now deal with ham
learn_mail ($HOME."/ham/", ".ham", "Legitimate Email/", 1, "--ham --showdots");

}
else
{
die "[SPAMASSASSIN] Unable to logon to IMAP mail account!
$options{uid}\n";
}

exit;

#
# read and learn mail from imap server
#
# arguments
# $dir directory to place retrieved messages in
# $ext file extension to use on retrieved messages
# $folder imap folder name on server
# $shared 0 if imap folder is in users mailbox
# 1 if imap folder is in shared name space or
# $sa_args additional arguments to specify to sa-learn
# (e.g. --spam or --ham)
#
sub learn_mail {
my $dir = shift (@_);
my $ext = shift (@_);
my $folder = shift (@_);
my $shared = shift (@_);
my $sa_args = shift (@_);

my $count = 0;

# tidy up directory before run
clear_directory ($dir, $ext);

# read mail from server
$count = read_mail ($dir, $ext, $folder, $shared);
if ($count > 0)
{
# learn about mail
sa_learn ($dir, $ext, $sa_args);

# tidy up files after sa-learn is called
clear_directory ($dir, $ext);
}
}


#
# reads mail from an imap folder and saves in a local directory
#
# arguments
# $dir directory to place retrieved messages in
# $ext file extension to use on retrieved messages
# $folder imap folder name on server
# $shared 0 if imap folder is in users mailbox
# 1 if imap folder is in shared name space or
sub read_mail {
my $dir = shift (@_);
my $ext = shift (@_);
my $folder = shift (@_);
my $shared = shift (@_);
my $count = 0;
my $target = "";

if ($shared)
{
# use a shared public folder instead
my ($prefix, $sep) = @{$imap->namespace->[2][0]}
or die "Can't get shared folder namespace or seperator: $@\n";

$target = $prefix.
($prefix =~ /\Q$sep\E$/ || $folder =~ /^\Q$sep/ ? "" : $sep).
$folder;
}
else { $target = $folder; }

$imap->select ($target) or die "Cannot select $target: $@\n";

# If a shared public folder is required uncomment the following
# lines and comment out the previous $imap->select line

# read through all messages
my @msgs = $imap->search("ALL");
foreach my $msg (@msgs)
{
($fh, $filename) = tempfile (SUFFIX => $ext, DIR => $dir);
$imap->message_to_file ($fh, $msg);
close $fh;
$count++;
}

if ($cron == 0) { print "Retrieved $count messages from $target\n"; }

return $count;
}

#
# Removes files in directory $dir with extension $ext
#
sub clear_directory{
my $dir = shift (@_);
my $ext = shift (@_);

opendir (DIR, $dir) or die "Couldn't open dir: $dir\n";
my @files = readdir (DIR);
close (DIR);

for (my $i = 0; $i <= $#files; $i++ ) { if ($files[$i] =~ /.*?$ext$/) { unlink ($dir.$files[$i]); } } } # # execute sa-learn command # sub sa_learn { my $dir = shift (@_); my $ext = shift (@_); my $type = shift (@_); my $learncmd = "/usr/bin/sa-learn ".$type." --dir ".$dir; if ($cron == 0) { $learncmd .= " --showdots"; } else { $learncmd .= " > /dev/null 2>&1"; }

#
# Run sa-learn script on spam directory
#
my $sh = Shell->new;
my @args = ($learncmd);

system (@args) == 0 or die "system @args failed: $?";
}

----------------------------------------------------------
I am very happy with the spam filtering results after implementing this.