Fighting spam


This page describes some ways in which you can help fight spam or at least reduce the amount of spam you receive in your Inbox. It doesn't go in to any detail on how to install or setup spamd or spamc for example or how to use the SpamAssassin configuration in a "global" sense (although it's possible to do).

Some of the configuration file locations may be distribution specific. Some of the configuration options may not work for your release of Linux distribution/release XYZ.

Please add your own details to this page or make or corrections.

Server level

If you bring your email in via an SMTP feed you can use DNSBLs (DNS Blacklists) to reject email during the SMTP communication. You really want to try do as much "fighting" at this level because it reduces system load. However, using DNSBLs only stops a small amount of spam.

Caution should be exercised when using DNSBLs. As was pointed out in the mailing list, they're a nightmare to track. They're constantly changing as are the policies. It's up to you to monitor what's being rejected and what policy changes may have occured with each DNSBL.

More problematically, allowing basic classifiers to decide an outright rejection before passing to a powerful classifier creates a pretty broken classifier on the whole, allowing, in this case, a significantly elevated rate of false positives. Spamassassin contains a range of classifiers and combines the outputs intelligently (actually it's not that intelligent, but it could be in the not-too-distant future). This smooths over the various performance quirks of the tests by knowing how good each test is as a predictor of spam. The performance of this classifier is very good. But beware: if you pre-filter based on one of the tests then a) the intelligent classifier will perform worse because it has fewer results to go on, and b) you have given undue credit to the performance of this rather worse classifier. (Equivalent article about Biometric Classifiers) -- DanPope


I'm using the following DNSBL setup in Exim configuration (under acl_check_rcpt):

  deny    message       = rejected because $sender_host_address is in a blacklist at $dnslist_domain\n$dnslist_text
          dnslists      = : \
                 : \

  warn    message       = X-Warning: $sender_host_address is in a blacklist at $dnslist_domain\n$dnslist_text
          dnslists      = : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
                 : \
 #               : \    # offline as of 2008-06-17
                 : \
                 appears to work the best. Most of the other DNSBLs don't get used. Some of them aren't true "spammer" lists. Some of them also list virus infected / compromised machines. The one which causes the most problems (although it only happens once a month, if that) is which often lists hosts incorrectly. Also note that if you're a high volume mail server you'll be generating a DNS lookup for every DNSBL listed, for each message that comes in to your MTA.

==== SPF (Sender Policy Framework) ===

SPF filters may cut down on the amount of spam you receive. They do provide a solid foundation for a less exploitable global email system and are thoroughly advisable. See WhatIsSPF for more details.

-- DanPope


Most spam fighting is done via SpamAssassin and most people do have it installed. By default it is quite useless but with some tweaks, it becomes very effective.

Part 1: Razor

Razor is a distributed spam classifying system. Razor can be used with SpamAssassin. So when some spam is checked, Razor is also invoked. Debian users can aptitude install razor. Once razor is installed:

  use_razor2              1

  # Razor2 checks
  loadplugin Mail::[[SpamAssassin]]::Plugin::Razor2

Part 2: Pyzor

Pyzor is another implementation of Razor. The difference is it's 100% open source. With Razor, the "server side" is closed source. Debian users can aptitude install pyzor. Once pyzor is installed:

  use_pyzor              1

  # Pyzor checks
  loadplugin Mail::[[SpamAssassin]]::Plugin::Pyzor

Part 3: Distributed Checksums Clearinghouse

Distributed Checksums Clearinghouse (or DCC) is once again a distrubted method of spam detection. Users send in spam emails to DCC (via a client). If lots of the same email are received then DCC "publishes" the emails checksum so everyone benefits.

Debian specific: Debian users can aptitude install dcc-client. This will attempt to start a daemon running which you don't actually need. So stop dcc-client (invoke-rc.d dcc-client stop) and then remove it from your runlevel (e.g. mv /etc/rc.runlevel/S20dcc-client /etc/rc.runlevel/xS20dcc-client).

You may also need to turn IPv6 support off in DCC otherwise DCC won't work properly and when you test SpamAssassin you will see "dbg: dcc: got response: socket(UDP): Address family not supported by protocol". To turn off IPv6:

  # cdcc "IPv6 off"
  # cdcc info | grep ^IPv6

It has also been suggested to delete the greylist entry for

  # cdcc "delete greylist"
  # cdcc info

There's also /etc/dcc/map.txt (/var/lib/dcc/map.txt is symlinked here as well). I'm not sure at the moment how to use map.txt as I can't seem to get it to work properly. Any help would be appreciated! -- DavidRamsden

  use_dcc              1

  # DCC checks
  loadplugin Mail::[[SpamAssassin]]::Plugin::DCC

Part 4: Languages and locales

I'm only expecting email in the English language (I'd love to know more languages...). And I'm only expecting the locale to be English. In my ~/.spamassassin/user_prefs I have set:

  ok_languages            en
  ok_locales              en

The above options will be ignored on Debian unstable for example. I'm not sure what the new way of doing this is.

Part 5: Bayesian analysis

Bayesian analysis is the bit you really really want to get working. This is the autolearning stuff that SpamAssassin can do. Bayesian is a conditional probability theorem. It has been applied to spam detection using lexical analysis. SpamAssassin as Bayesian already built in to it so there's no need to install anything additional. You just need to set/add the following to your ~/.spamassassin/user_prefs:

  use_bayes               1
  bayes_auto_learn        1

There will be more on Bayesian later.

Part 6: A complete user_prefs file

Here is what my (DavidRamsden) user_prefs file looks like (minus the comments):

  required_score           5.0


  blacklist_from          *

  rewrite_subject         1
  rewrite_header subject  [[***|SPAM ***]]
  use_terse_report        0
  report_safe             1

  use_bayes               1
  bayes_auto_learn        1

  skip_rbl_checks         0
  use_pyzor               1
  use_razor2              1
  use_dcc                 1

  ok_languages            en
  ok_locales              en

I've set my threshold (required_score) to 5.0. This is default. Before I was using Bayesian analysis or Razor, DCC et al I had this set at 2.0. I've told SpamAssassin to always whitelist a few address. Hopefully these won't ever send me spam and they're important addresses. There's also an example of how to blacklist an address. Wildcards can be used, as can be seen.

I've told SpamAssassin to rewrite the subject line of any spam messages by prefixing "SPAM ***". This makes it easy to identify and I can also apply a mail filtering rule (although it's better to look at the X-Headers). I've set report_safe too which means the actual spam email becomes an attachment of the report. This way I don't ever have to look at it!

All the other options have been covered in the other parts.

Testing !SpamAssassin

Once you've made any changes to SpamAssassin, you should always test it. Otherwise you may end up with no email (worse case). To test:

  $ spamassassin --lint -D < /usr/share/doc/spamassassin/examples/sample-spam.txt &> spamtest.txt

The path to sample-spam.txt is once again specific to Debian.

Now examine the output which will be in spamtest.txt. Look for any errors or misconfigurations. You may have to run the above test twice, especially if razor, pyzor etc. haven't been run before. If you do spot any problems try Google and if that doesn't help, ask on the MailingList.

Bayesian analysis: sa-learn

As I've mention previously, when I enabled Bayesian analysis and trained it correctly, it really reduced the amount of spam coming in to my Inbox. That's one important thing to remember – it takes time to train Bayes. You need >200 spam messages and >200 ham messages for Bayes to even start trying to classify email.

The tool to use is: sa-learn (man sa-learn).

==== Learning from ham ===

Ham is "good" email. You should run sa-learn on a mailbox that contains only ham email. Introducing spam accidently will cause false-positives and false-negatives. If you're using mbox style mailboxes you can use:

  $ sa-learn --ham ~/Mail/HantsLUG

(assuming your mbox mailboxes are stored in $HOME under a directory called Mail)

If you're using Maildir style mailboxes you can use:

  $ sa-learn --ham ~/Maildir/.[[HantsLUG/cur]]/

Learning from spam

Spam is "bad" email. You should run sa-learn on a mailbox that contains only spam email. Introducing ham accidently will cause false-positives and false-negatives. If you're using mbox style mailboxes you can use:

  $ sa-learn --spam ~/Mail/Spam

(assuming your mbox mailboxes are stored in $HOME under a directory called Mail)

If you're using Maildir style mailboxes you can use:

  $ sa-learn --spam ~/Maildir/.Spam/cur/

Displaying the contents of the Bayes database

To see how much data Bayes has collected, you can run:

  $ sa-learn --dump magic

What happens if spam drops in to my Inbox or visa-versa?

I've created a mail folder called "This is spam". If some spam drops in to my Inbox, I drag it to my "This is spam" folder and at the end of each day, I run:

  $ sa-learn --spam ~/Maildir/.This\ is\ spam/cur/

You'll need to change the above path so it's suitable for your setup.

I've done the same for any ham which ends up being classed as spam. I have a mail folder called "This is ham" and run the above command again but with --ham against this folder.

Automating the learning process

You can write a shell script to do all of the above for you and either run it as a cron job or run it manually. My (DavidRamsden) script looks like:


  echo "*** Learning from ham... ***"

  sa-learn --ham --showdots /home/david/Maildir/cur/
  sa-learn --ham --showdots /home/david/Maildir/.[[HantsLUG/cur]]/
  sa-learn --ham --showdots /home/david/Maildir/.[[HostAP/cur]]/
  sa-learn --ham --showdots /home/david/Maildir/.[[DebianSecurity/cur]]/
  sa-learn --ham --showdots /home/david/Maildir/.aMSN-devel/cur/
  sa-learn --ham --showdots /home/david/Maildir/.[[UniversityProject/cur]]/
  sa-learn --ham --showdots /home/david/Maildir/.Spam.This\ is\ ham/cur/

  echo "*** Learning from spam... ***"

  sa-learn --spam --showdots /home/david/Maildir/.Spam/cur/
  sa-learn --spam --showdots /home/david/Maildir/.Spam.This\ is\ spam/cur/


Another useful tool is procmail. Why is it useful?

First of all, here is a snippet from my (DavidRamsden) ~/.procmailrc:

  * ^X-Spam-Status: Yes

Here I am saying if "X-Spam-Status: Yes" appears in the headers, SpamAssassin has classified the email as spam so automatically move it to my Spam folder. I use Maildir format mailboxes and IMAP.

Now that Bayes has been running for a while, I'm starting to see spam with a score of >25.0. This is always spam so it would be pretty safe for me to write a procmail rule that sends any email with a score greater than 25.0 straight to /dev/null. That way, I never have the trouble of seeing it appears as a "new message".


I've implemented all of the above and it works for me (tm). I sometimes get one or two spam emails in my Inbox overnight. This doesn't bother me when there's say 20+ emails correctly classified in my Spam folder. Using and constantly training Bayes is the key. I've also never (so far) had an email incorrectly classified as spam. -- DavidRamsden

Page written by DavidRamsden

LinuxHints/FightingSpam (last edited 2010-05-28 17:14:54 by AdamTrickett)