For anyone who gets lots of spam mail, I typically recommend that their anti-spam management plan must consist of a multi-stage process. A common open source solution to that (and the one that I use personally) is a server-based SpamAssassin (SA) front end, followed by a client-based bayes filter, in this case the Thunderbird (TB) default filter. Both filters are tuned to never give false positives, with Uncertain emails show in an Uncertain folder that I regularly watch.
In the Thunderbird 3.0 / SeaMonkey 2.0 series, I snuck in a little hidden preference to allow modifications to the way that the TB bayes filter does it tokenization. The main point of this was to allow better transfer of information between SA and TB. I’d like to describe here how to use that feature.
SA’s decision process involves two key steps. First, they evaluate the message with zillions of rules, and tag the message with each rule that the message hits. Second, they have a method of combining all of those tags into a final junk score for the message, which is used to decide whether to tag the overall message as spam.
SA’s rules include many tests that are not done by TB (score one point for SA). Yet when they take all of their rules and combine them together, in most cases they are using broad measures of goodness and spaminess that apply to some concept of a universal email. In contrast, TB’s bayes algorithm will precisely tune the junk analysis to the particular set of emails that a user gets. So if you are in the Viagra business, you can tune your local TB filter to accept those. (score one point for TB).
Wouldn’t it be great if you could combine together the superior rule set of SA, with the superior decision making customization of TB? That is the point of the hidden preference.
SA communicates its message tags to TB in the form of a custom header, X-SPAM-STATUS. A sample value of that from a spam message that I got is:
X-Spam-Status: No, score=4.9 required=5.0 tests=HTML_IMAGE_ONLY_16,
MIME_HTML_ONLY, SPF_PASS shortcircuit=no autolearn=disabled version=3.2.5
This message was not marked as spam by SA, as it just barely missed the score=5.0 cutoff point, so arrives in TB as a suspicious but nevertheless good message. Yet there is lots of information about the tests that SA did in that header. TB should be able to take advantage of that information.
In the default TB configuration, the entire content of the X-Spam-Status header is treated as a single token. So TB will only respond to exact matches to that header. Yet those tests can appear in a variety of combinations, so that is not really taking advantage of the information effectively.
The answer provided in TB3 / SM2 is to provide a hidden preference that allows you to tell TB to break that particular header up into lots of separate tokens, and analyze each of those separately. That is documented in the code here. In the preference, you specify a particular header, and give a list of delimiters for that tokenization. So to process the x-spam-status header as individual tokens including both space and comma as delimiters, you add the preference:
and assign it the string value ” ,\t\n\r\f”
(That can be hard to set with all of those special characters in it). That includes space, comma, tab, line feed, carriage return, and form feed as delimiters.
You really can’t do this to an existing training corpus without getting strange results. So you must reset the training data and retrain after you do that.
As a sample, the tokens used by TB to process an email with my example x-spam-status is:
This image uses the “Junk Analysis Detail” report from my JunQuilla extension.
Things to notice:
- The tokens that TB is using to analyze this email are mostly these x-spam-status tokens, as this is a mostly-image spam message with little text for TB to analyze. So the use of the SA tokens greatly enhances the power of TB’s internal filter.
- The final score for this, 84, ended up with the message marked as junk for me. I run a cutoff of 75, while the default cutoff is 90.
- Many of the x-spam-status tokens are actually working against us here. For example, x-spam-status:required=5.0 has a score of 23, which moves the message toward the “good” status. Yet that token appears on every single email I receive, so should be neutral. What is going on here?
This is the issue that I reported in a previous post “bad effects on junk training corpus from change” That is, recently I have been automatically training more messages as Good than Junk. “x-spam-status:required=5.0” must not exist in my older emails for some reason, so this bias is appearing because 1) the token changed at some point in time, and 2) there is a mismatch in the number of junk and good emails being trained recently. I really need to figure out a solution to this. What I will probably do is to add features to FiltaQuilla to allow me to precisely match the training of junk and good to remove this bias.
I still need to do some tests of this to directly compare the performance of the spam filters with and without the tokenization to be sure this is really a good idea. That is quite tricky to do unfortunately.
In spite of these issues, I would recommend to new installations that you use this alternate tokenization when you are using a Spam Assassin front-end to Thunderbird or SeaMonkey.