The Winding Path of Mail
For the past couple of months I have been working to learn more about the processing of email and how I could more intelligently filter and sort it before it reaches me. One of the side effects of using a laptop as your main computer is that while you can use all sorts of client side mail filtering applications, you really don’t want to. Where possible, I am using my server to handle the mail processing. For a message to reach my Mail.app inbox, it is almost definitely something I want to see.
The first part of the ride is Postfix with TLS for accepting incoming messages, with the usual set of restrictions, such as on dynamic IPs, on what sort of host I’ll allow a message from. This keeps a good number of the spammers at bay, as smtpd will simply refuse connections much of the time. In the event you have met all the initial outside requirements to not look like a spammer, and my server accepts the connection, you must provide a valid recipient username or the server will not accept the message. In my configuration, Postfix will not attempt to deliver to a user that it knows does not exist and then retry the message until finally sending a bounce reply. None of that; it just won’t try at all in the first place. If you are sending to a valid user, smtpd will hand the mail over the local daemon for A Mail And Virus Scanner:amavis.
Amavis isn’t actually a mail or virus scanner of anything on its own. It is a frontend for other popular scanners. It is far easier to simply tweak /etc/amavis.conf to set all your scanning options for whatever you wish to pass mail through than to configure the individual programs themselves. Amavis does all the hard work for you, and you need only add a few lines to your Postfix master.cf to pass the mail along to amavis. The one thing amavis does do itself is check the message for bad headers. If a message is found to have bad headers, the message is compressed and sent to the quarantine: /var/virusmails, with a new name derived from the Message-ID, prefixed by badh-.
In my case, I am telling amavis to first send the message through Clam Antivirus daemon, clamd. ClamAV uses built-in and external tools along with the most frequently updated virus database I know of to check the message out. I don’t actually have a good reason for doing this. I don’t own any computers running Windows and I’m not the least concerned about mail security for Linux or OS X. For the most part, ClamAV is a learning exercise and because I would like to differentiate the virus mail from spam mail and know the source of it. Any message found to contain a virus attachment in any form is rewritten so the attachment is harmless, and then renamed based on the Message-ID, prefixed with virus- and stored in /var/virusmails.
After ClamAV okays the message, it goes to SpamAssassin’s spamd. Among the several popular spam filtering engines out there, I like SpamAssassin for its multiple methods of scanning, ease of adding new rules, and wide availability of support and documentation. SpamAssassin can use hard-and-fast filter rules, RBLs, and Bayesian scoring to determine a message’s spamminess. It comes with a handful of good default Bayesian rules, such as looking in the body for obfuscated URLs, spoofed From: headers, or mentions of common known unsolicited products. The important thing about Bayesian scoring is that unlike a filter rule which says ‘This message is/is not spam,’ it uses a scoring system that is itself dynamic and the score numbers can change based on certain factors, and over time the Bayes score points change depending on how SpamAssassin has learned what is and is not spam to you. The score can even dip into the negatives if a message appears very legitimate. There are 6 categories of what SpamAssassin can do once it scores a message. I have lowered my scoring threshold from the defaults. An explanation of the scores follows.
Below 0.0
CLEAN: Deliver, notify spamd to learn from this message
0.0
CLEAN: Deliver, notify spamd, add score headers explaining why it's clean
1.5
SPAMMY: Deliver, notify spamd, add score headers, and "***SPAM***" to the Subject: header
2.75
SPAM: Quarantine, notify spamd, add score headers
8
SPAM: Quarantine, add score headers, but do not notify spamd.
25
SPAM: Delete
For a message to be rewritten at all, it has to have a non-negative score. Negative scores are achieved by having very positive factors, such as being sent from the local machine. If it breaks on an even 0 or anything higher up to 24.999, it is rewritten. The headers look like:
X-Spam-Flag: (either YES or NO)
X-Spam-Score: (expressed to the thousandths place)
X-Spam-Level: (the spam level expressed as a line of asterisks)
X-Spam-Status: (Yes or No, followed by the score and all the tests applied which resulted in the score)
After applying headers, amavis will tell spamd to learn from this message, and the Bayesian scoring database will be updated to reflect this new information. All the interesting stuff with SpamAssassin’s Bayesian engine happens between 0 and your quarantine level.
At 1.5, the message is deemed “spammy”: looks like spam, but not entirely sure, so prefix the Subject line with “***SPAM***” and deliver anyway. Most mail clients will understand this and throw it in your Junk folder. If it gets up to 2.75, the message is considered spam, and after the X-Spam-* headers are added, spamd is notified, and the message is renamed to a pseudo-random identifier based on the Message-ID header, gzip compressed, and moved to /var/virusmails.
At a certain point, it is no longer beneficial to teach spamd about new spam because the score was so high that it wasn’t near the ham/spam tipping point. The default is 10, but I lowered mine to 8. The same thing happens at 2.75 that happens at 8, except that once it reaches 8, the Bayesian database isn’t updated.
At an even higher point, the message is almost certainly spam so it isn’t even smart to keep the message at all. The default score for this action, 25, will simply cause amavis to delete the message outright. This may sound dangerous, but a message has to be ridiculously spammy to get this high. Even though that’s true, I disabled the delete level for my server out of curiosity, and saw a handful of messages up in the 40+ range and one up in the 50s. Yes, they are ridiculously spammy.
Once a message is approved by SpamAssassin, it is sent back into Postfix’s local smtp delivery agent, which then sends each message to:
Procmail is mail processing program that can move, rename, and rewrite a message any way you want based on almost any criteria. I am on a couple of mailing lists where I’m interested in ~80% of the messages and ~20% I’ll just delete on sight. I use procmail to skip this step, so I can filter the messages based on the mailing list into their own folders right on the server, before the client ever sees them. And those ~%20 I won’t care about will just be deleted. Procmail can do a lot more, but my main use is to sort mail into folders on the server side.
Procmail drops the message into the appropriate place~/Mail, an IMAP folder tree. After there, the mail reception process is over, and the message will sit until dealt with by the client.
For serving mail, I use Courier-IMAP with GNU TLS encryption and a self-signed SSL certificate. I can connect to my IMAP store using a local program such as Pine or Mutt, Mail.app, the iPhone, or any Windows-based mail client should the need arise.
So after all the good mail makes it through to my inbox, and all the list mail gets to their folders, what about those on-the-fence messages sitting in the IMAP “Junk” folder with the “***SPAM***” subject? Those are the ones that SpamAssassin couldn’t quite determine, and this is the apex of the spam/ham seesaw. The correct thing to do is manually go through this folder, sort out what is and is not spam, and then teach spamd appropriately so that over time fewer messages end up here.
The workflow begins by me just occasionally scrolling through Junk and seeing if there’s anything in there that shouldn’t be. Not every message in Junk is going to have ***SPAM*** in the subject; only the ones that SpamAssassin rewrote that way. What if a message was flagged CLEAN, but was actually spam, and my mail client detected it and moved it here? Wouldn’t it be good for spamd to learn from those too? So the first step is to recognize that the messages in Junk come from a variety of sources. Most mail clients will even let me put the Junk from other accounts in the single, centralized Junk folder on my server.
To deal with these messages, I wrote a script that imports everything in Junk into the Bayesian database, rewrites the messages with “
But what about those few false positives? SpamAssassin could surely benefit from being force-taught that these are legitimate mail, or ‘ham’ instead of me simply deleting them so that it doesn’t see them as spam. The only way I know of to do this is to manually move those messages to a safe folder, such as Sent, or some other dedicated folder, and add to my spamlearn script a directive for learning these as ham. But I don’t want to put them in Sent — that doesn’t make sense. And I don’t want them in Trash, because sometimes spam messages do end up there too. I would need to make a dedicated “Ham” or “Okay” or “False Positives” IMAP folder and import that. I could do this, but moving the messages there instead of deleting them from Junk is another time-consuming step, and the hit rate is so low on these messages that at this time it doesn’t warrant it.
