Content based filtering

C

A spam filter looks at many things when it’s deciding whether or not to deliver a message to the recipients inbox, usually divided into two broad categories – the behaviour of the sender and the content of the message.
When we talk about sender behaviour we’ll often dive headfirst into the technical details of how that’s monitored and tracked – history of mail from the same IP address, SPF records, good reverse DNS, send rates and ramping, polite SMTP level behaviour, DKIM and domain-based reputation and so on. If all of those are OK and the mail still doesn’t get delivered then you might throw up your hands, fall back on “it’s content-based filtering” and not leave it at that.
There’s just as much detail and scope for diagnosis in content-based filtering, though, it’s just a bit more complex, so some delivery folks tend to gloss over it. If you’re sending mail that people want to receive, you’re sure you’re sending the mail technically correctly and you have a decent reputation as a sender then it’s time to look at the content.
You want your mail to look just like wanted mail from reputable, competent senders and to look different to unwanted mail, viruses, phishing emails, botnet spoor and so on. And not just to mechanical spam filters – if a postmaster looks at your email, you want it to look clean, honest and competently put together to them too.
Some of the distinctive content differences between wanted and unwanted email are due to the content as written by the sender, some of them are due to senders of unwanted email trying to hide their identity or their content, but many of them are due to the different quality software used to send each sort of mail. Mail clients used by individuals, and content composition software used by high quality ESPs tends to be well written and complies with both the email and MIME RFCs, and the unwritten best common practices for email composition. The software used by spammers, botnets, viruses and low quality ESPs tends not to do so well.
Here’s a (partial) list of some of the things to consider:

MIME Structure
Good email tends to be either plain text or a multipart mail consisting of two versions of the same message, one in HTML and one in plain text.
Bad email often doesn’t have the plain text part. Either it’s missing altogether, or it’s completely different (much shorter) content than the HTML part.
Text Encoding
Bad email often tries to hide it’s content from spam filters. One common way of doing this is to use base64 encoding for text where quoted-printable encoding would be appropriate.
Lazy software developers sometimes base64 encode everything, as it’s less work than deciding which encoding is appropriate for a message part. Doing that looks dishonest or incompetent to filters and postmasters.
Images
Another way bad email tries to hide it’s content is by misuse of images. The most obvious example of this is mail that consists of just a single huge image – sometimes that’s just because it’s easier for the graphic designer to do that way, but more often it’s a spammer trying to hide their content from filters. Either way, it’s much less likely to be delivered.
Including CAN-SPAM required boilerplate (such as the postal address) purely as an image is another thing that’s distinctive to bad email. Bad email hides the contact address in that way so as to avoid people being able to search based on it to track their behaviour across brands and shell companies, and to stop people using it to key targeted spam filters on. Good email doesn’t need to do that.
HTML Structure
If your email is completely unreadable with images not displayed, it’s not going to be a good marketing piece in the (common) case that images aren’t shown. Including appropriate ALT text for each image not only makes it look better to recipients when images are turned off, it also makes it look more legitimate to postmasters with ticketing systems that don’t display images, or only show the raw HTML. It sometimes makes spam filters happier too.
That’s just one example of sending “good” html.
Phishy URLs
Bad email sent by phishers often includes links that look like <a href=”http://phisher.ru/”>bank.com</a>, where the message is trying to look like legitimate email from bank.com, but it’s sending readers to phisher.ru instead. <a href=”http://bank.com.whatever.phisher.ru/”>bank.com</a> is an even more obvious attempt to defraud the recipient.
Otherwise good email sent by naive ESPs often includes links that look like <a href=”http://click.esp.com?trackdata=xxxx&target=bank.com/”>bank.com</a>. To a spam filter, that looks much the same as a typical phishing URL, and the delivery is not going to go well.
Bad Phrasing or Appearance
Even if 100% of your recipients desperately wait for every issue of your newsletter there are some phrases that will cause you more problems than others. “Looking spammy” is one of the worst things for your email if you need to discuss a delivery issue with a postmaster or a filter vendor – if it “looks like spam” they’re much less likely to believe it’s really wanted by recipients.
If your newsletter is about “Moustache Rides” (real example, I’m not making this up) then you might not be able to fix the phrasing, but you should try and make the rest of the newsletter look professionally put together, as much as you can anyway.
URL Reputation
If two emails received “look similar” and the recipient complained about the first one, it’s likely the second one will be unwanted too. But mechanically detecting similar content is complex and expensive to do, so a common trick is to “fingerprint” each email by looking for distinctive features in it, and considering messages that share a fingerprint to be similar.
One of the simpler fingerprints to use is the URLs used in links in the mail, more specifically the hostnames of the links. If someone is sending bad email and you send email using the same URLs or hostnames, it’s likely to be treated poorly.
Fiddly Trivia
There are lots of other fiddly little things that spam filters key on too. You shouldn’t obsess about them too much, but it’s worth being aware of the sort of things that can make a difference. SpamAssassin publish some of the rules they use. If you look at the rules, look at the scores too – a rule with a score of 0.001 isn’t very relevant.

About the author

1 comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • Mail that gets hits on SpamAssassin rules where the default score is 0.01 or smaller doesn’t necessarily mean that the message will be deliverable at some sites. The reason for using small scores (over hidden rules that don’t score at all) is essentially a tracking function, to look for patterns of hits. Some sites may use hits on the low-scoring rules as a part of local meta-rules, where several rules hit in combinations are triggers for other rules that score much more highly. Thus, there may be local (non-standard) rules that look for hits on a set of rules (say, hits on 4 rules out of a set of 7 or 8 rules), and even if those rules individually score 0.01 individually, and 0.04 in aggregate), the combination may be enough to trigger a rule that scores something like 4.5 points.

By steve

Recent Posts

Archives

Follow Us