… or Why do spam filters sometimes have some very strange ideas?
It’s been dogma for a long time that if you’re doing email marketing you should avoid using a .biz domain in your mails. Even if your main website was in .biz, you should use something different in your messages, perhaps a website you buy solely for use in email that redirects to your real .biz website. Last year I looked at why that was, and what could be done about it.
One main reason for avoiding it has been resolved (so if you’ve been avoiding using .biz URLs in your mail now might be a good time to re-test that decision). And enough time has gone by that I can share the ugly reasons as to why .biz was considered a sure sign of spam without good reason for so long without upsetting everyone.
The simple reason was SpamAssassin. SpamAssassin is very widely used to filter mail, both in it’s open source version and buried anonymously deep inside countless commercial spam filters and filtering appliances. Not only that, but SpamAssassin is readily available, so most people looking to do pre-mailing content checks or looking at why content-based filters are objecting to a particular email will use SpamAssassin as their model. It’s very widely deployed, and influential far beyond the size of it’s deployed base.
SpamAssassin is a score-based spam filter – it checks an email against hundreds of rules, adds up the scores of each rule that matches and, in typical setups, decides the mail is spam if the total score is five or more. Pretty reasonable, but here are a few of the rules and scores (from the 2006 version of SpamAssassin)
- 1.392 Advance Fee Fraud (Nigerian 419)
- 0.493 Refers to an erectile drug
- 1.995 Subject contains G a p p y T e x t
- 0.496 Message is 40-50% HTML
- 2.100 From: domain has a series of 7 consonants
- 1.635 Possible porn – Hardcore Porn
- 2.013 Contains a URL in the BIZ top level domain
- 1.273 Contains a URL in the INFO top level domain
You can’t quite treat the scores as SpamAssassins measure of the “spamminess” of a message (“a .biz URL is 23% spammier than hardcore porn” … “The URL microsoft.biz is about as spammy as From: Ignatious T. Aardvark <firstname.lastname@example.org>“) but it’s pretty clear that using a .biz domain in your mail had a huge effect on your SpamAssassin score, and a bad risk to take if you could easily avoid it.
So, was .biz really that spam-ridden? I recall it being pretty bad when it first launched, so it’s reasonable that SpamAssassin has that rule, but was it still bad by 2006? Bad enough to merit a score quite that high? That’s hard to measure, but a reasonable metric is the percentage of domains in each top level domain (.com, .net, .biz etc) that had been spotted as definite spam sign by the folks at SURBL.
So .biz looks just fine – comparable with .com or .net, and certainly a lot better than .info. Why was SpamAssassin still treating it as so spammy?
SpamAssassin developers measure and develop their scores based on several corpuses of recently received email, hand categorised into spam mail and non-spam (“ham”) mail. Like many other spam filters, they stay fairly vague about where exactly these corpuses come from (to avoid people gaming the system) but they seem to be based mostly on the personal mailboxes of developers. Of the five corpuses SpamAssassin were using in 2006, four saw almost no .biz spam, but one saw quite a lot (graph of .biz URLs in spam). More importantly, though, none of them saw more than tiny number of .biz URLs in non-spam(graph of .biz URLs in non-spam).
The algorithm that SpamAssassin uses to assign scores to the rules is complex, but loosely speaking if a rule helps to correctly classify one of the mails in the spam corpus as spam, then the score of that rule will tend to be increased, while if a rule helps to wrongly classify non-spam as spam then the score for that rule will tend to be decreased. In the test corpuses used, .biz URLs hardly ever appear in non-spam, so there’s no pressure to reduce the score assigned to that rule.
So the final answer to the question in the title is:
- Long, long ago when .biz was new it was used by a lot of spammers (because it was new, so a lot of good domains were easily available).
- SpamAssassin added a rule to recognize .biz URLs, and increase the spam score of mails containing them
- SpamAssassin is very influential, even more so than it’s wide deployment makes it.
- Legitimate mailers saw that SpamAssassin would punish them for using a .biz URL, so they pretty much all avoid using .biz URLs in their email.
- With effectively no legitimate bulk mail using .biz URLs, there’s nothing to keep the SpamAssassin score for the “contains a .biz URL” from creeping up, and being even more punitive to use of .biz URLs.
- Go to step 4
This leads to a vicious circle where legitimate mailers don’t use .biz as SpamAssassin would punish them for doing so, and SpamAssassin continues to punish anyone using .biz URLs because they’re not used by legitimate mailers. SpamAssassin eventually broke this particular circle by removing the rule from their latest release, but not until it had had a major effect on use of .biz URLs that still persists.
The .biz issue has since been resolved, but there’s a broader deliverability conclusion to draw from this story. While on a branding and image level you want your messages to stand out from all your competitors’ messages, on a technical level you want your mails to be similar to those of other legitimate mailers. That way, if there’s an oddity in a content filter that makes it classify your mail as spam it’ll likely be classifying lots of other legitimate mail as spam too, and be fixed fairly quickly (probably before it’s deployed into production).
That includes things like the way you use HTML and MIME, the way you register the domain names you use and the way you use them as URLs in messages and a bunch of other things. Being aware of the sort of things that content-filters like SpamAssassin look at is a good place to start.