Character encoding

This morning, someone asked an interesting question.

Last time I worked with the actual HTML design of emails (a long time ago), <head> was not really needed. Is this still true for the most part? Any reason why you still want to include <head> + meta, title tags in emails nowadays?

There are several bits of information in the <head> part of an HTML document that can affect the rendering of it – there’s the doctype, which will control the html rendering model, there’s often some css which will control the styling, and there’s often a meta tag that states what character set is used in the document.
That last one is interesting in the case of a piece of HTML that’s being sent as part of a MIME email – as MIME already has a perfectly good way of specifying the character set a message has, as part of the Content-Type header. I looked at a few bulk messages I’d received recently and, sure enough, most of them include the <head> section, and have a meta tag in there that defines the character set. All of them have a character set defined in the Content-Type header. Sometimes those character sets didn’t match:

Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7Bit
<html>
<head>
<title></title>
<meta http-equiv=”Content-Type” content=”text/html; charset=windows-1252″>
<meta name=”title” content=”New CS5.5 Web Premium” />a snippet from this mornings email

What happens when they don’t match? I don’t think it’s defined anywhere. Time for some empirical testing.
Testing! For Science!
I needed to create some test emails which would be visibly different depending on which character set the mail client decided to use. I picked out two character sets – ISO-8859-15 and ISO-8859-16, as they differ from each other and from ISO-8859-1 enough that I could differentiate them just by the way two characters were rendered.
The byte 0xfd renders as e-with-a-tail (ę) in ISO-8859-16 and as y-acute (ý) in the other two character sets, while 0xa4 renders as the generic currency symbol (¤) in ISO-8859-1 and as a euro symbol (€) in the other two. I included the characters in two different ways in each test message – once as a raw character in the body of the message (=a4 or =fd in quoted-printable format), and once as a numeric HTML entity (&#164; or &#253;).
This is what I found:

Mail clientMime charsetHTML meta charsetRaw characterHTML entity
Mail.app-15-16-15-1
Gmail-15-1
Mail.app-16-15-16-1
Gmail-1-1
Mail.app-15none-15-1
Gmail-1-1
Mail.appnone-16broke-1
Gmail-1-1
Mail.appus-ascii-16broke-1
Gmail-1-1

 
There are several things to see from this data. The simple one first – regardless of which character set I declared, and where I declared it, both mail clients rendered characters written as HTML numeric entities (“&#164;”) consistently in ISO-8859-1. (This isn’t really a surprise, as it’s how the HTML specs define them.)
Raw characters were much less consistent. Mail.app consistently used the character set declared in the MIME Content-Type header when it was set to something reasonable, and ignored the encoding in the HTML meta tag. Giving it an unreasonable character set in the Content-Type header caused it to render 0xfd as a double dagger (‡), which makes no sense at all in any character set I can find. Gmail managed to render the raw character in ISO-8859-15 correctly, but gave up and fell back to using ISO-8859-1 for everything else.
Conclusions
There are a few things we can conclude from this, I think, even though it really needs some comparisons with different mail clients, and some testing with other character sets (including unicode and some of the asian sets).

  1. Don’t bother with putting HTML meta content-type tags in your HTML
  2. Send your text/html parts as plain 7 bit ascii, using HTML entities for non-ascii characters
  3. It might be less confusing to use named entities such as &copy; rather than numeric ones such as &#169;
  4. If you’re generating numeric entities from user-generated input, be wary of input that’s not ISO-8859-1 or Windows-1252
  5. Character set conversion is hard, lets go unicode

I’ve made the test emails I used available for download. From a unix prompt, with swaks installed, you can send them like this:
for i in charset*.eml ; do swaks –to your@email.address –from your@email.address –server your.email.server –data – <$i; done
 
 

Related Posts

Real. Or. Phish?

After Epsilon lost a bunch of customer lists last week, I’ve been keeping an eye open to see if any of the vendors I work with had any of my email addresses stolen – not least because it’ll be interesting to see where this data ends up.
Yesterday I got mail from Marriott, telling me that “unauthorized third party gained access to a number of Epsilon’s accounts including Marriott’s email list.”. Great! Lets start looking for spam to my Marriott tagged address, or for phishing targeted at Marriott customers.
I hit what looks like paydirt this morning. Plausible looking mail with Marriott branding, nothing specific to me other than name and (tagged) email address.
It’s time to play Real. Or. Phish?
1. Branding and spelling is all good. It’s using decent stock photos, and what looks like a real Marriott logo.
All very easy to fake, but if it’s a phish it’s pretty well done. Then again, phishes often steal real content and just change out the links.
Conclusion? Real. Maybe.
2. The mail wasn’t sent from marriott.com, or any domain related to it. Instead, it came from “Marriott@marriott-email.com”.
This is classic phish behaviour – using a lookalike domain such as “paypal-billing.com” or “aolsecurity.com” so as to look as though you’re associated with a company, yet to be able to use a domain name you have full control of, so as to be able to host websites, receive email, sign with DKIM, all that sort of thing.
Conclusion? Phish.
3. SPF pass
Given that the mail was sent “from” marriott-email.com, and not from marriott.com, this is pretty meaningless. But it did pass an SPF check.
Conclusion? Neutral.
4. DKIM fail
Authentication-Results: m.wordtothewise.com; dkim=fail (verification failed; insecure key) header.i=@marriott-email.com;
As the mail was sent “from” marriott-email.com it should have been possible for the owner of that domain (presumably the phisher) to sign it with DKIM. That they didn’t isn’t a good sign at all.
Conclusion? Phish.
5. Badly obfuscated headers
From: =?iso-8859-1?B?TWFycmlvdHQgUmV3YXJkcw==?= <Marriott@marriott-email.com>
Subject: =?iso-8859-1?B?WW91ciBBY2NvdW50IJYgVXAgdG8gJDEwMCBjb3Vwb24=?=

Base 64 encoding of headers is an old spammer trick used to make them more difficult for naive spam filters to handle. That doesn’t work well with more modern spam filters, but spammers and phishers still tend to do it so as to make it harder for abuse desks to read the content of phishes forwarded to them with complaints. There’s no legitimate reason to encode plain ascii fields in this way. Spamassassin didn’t like the message because of this.
Conclusion? Phish.
6. Well-crafted multipart/alternative mail, with valid, well-encoded (quoted-printable) plain text and html parts
Just like the branding and spelling, this is very well done for a phish. But again, it’s commonly something that’s stolen from legitimate email and modified slightly.
Conclusion? Real, probably.
7. Typical content links in the email
Most of the content links in the email are to things like “http://marriott-email.com/16433acf1layfousiaey2oniaaaaaalfqkc4qmz76deyaaaaa”, which is consistent with the from address, at least. This isn’t the sort of URL a real company website tends to use, but it’s not that unusual for click tracking software to do something like this.
Conclusion? Neutral
8. Atypical content links in the email
We also have other links:

Read More

Abuse Reporting Format

J.D. has a great post digging into ARF, the abuse reporting format used by most feedback loops.
If you’re interested in following along, you might find this annotated example ARF report handy.

Read More

Clicktracking 2: Electric Boogaloo

A week or so back I talked about clicktracking links, and how to put them together to avoid abuse and blocking issues.
Since then I’ve come across another issue with click tracking links that’s not terribly obvious, and that you’re not that likely to come across, but if you do get hit by it could be very painful – phishing and malware filters in web browsers.
Visting this site may harm your computer
First, some background about how a lot of malware is distributed, what’s known as “drive-by malware”. This is where the hostile code infects the victims machine without them taking any action to download and run it, rather they just visit a hostile website and that website silently infects their computer.
The malware authors get people to visit the hostile website in quite a few different ways – email spam, blog comment spam, web forum spam, banner ads purchased on legitimate websites and compromised legitimate websites, amongst others.
That last one, compromised legitimate websites, is the type we’re interested in. The sites compromised aren’t usually a single, high-profile website. Rather, they tend to be a whole bunch of websites that are running some vulnerable web application – if there’s a security flaw in, for example, WordPress blog software then a malware author can compromise thousands of little blog sites, and embed malware code in each of them. Anyone visiting any of those sites risks being infected, and becoming part of a botnet.
Because the vulnerable websites are all compromised mechanically in the same way, the URLs of the infected pages tend to look much the same, just with different hostnames – http://example.com/foo/bar/baz.html, http://www.somewhereelse.invalid/foo/bar/baz.html and http://a.net/foo/bar/baz.html – and they serve up just the same malware (or, just as often, redirect the user to a site in russia or china that serves up the malware that infects their machine).
A malware filter operator might receive a report about http://example.com/foo/bar/baz.html and decide that it was infected with malware, adding example.com to a blacklist. A smart filter operator might decide that this might be just one example of a widespread compromise, and go looking for the same malware elsewhere. If it goes to http//a.net/foo/bar/baz.html and finds the exact same content, it’ll know that that’s another instance of the infection, and add a.net to the blacklist.
What does this have to do with clickthrough links?
Well, an obvious way to implement clickthrough links is to use a custom hostname for each customer (“click.customer.com“), and have all those pointing at a single clickthrough webserver. It’s tedious to setup the webserver to respond to each hostname as you add a new customer, though, so you decide to have the webserver ignore the hostname. That’ll work fine – if you have customer1 using a clickthrough link like http://click.customer1.com/123/456/789.html you’d have the webserver ignore “click.customer1.com” and just read the information it needs from “123/456/789.html” and send the redirect.
But that means that if you also have customer2, using the hostname click.customer2.com, then the URL http://click.customer2.com/123/456/789.html it will redirect to customer1’s content.
If a malware filter decides that http://click.customer1.com/123/456/789.html redirects to a phishing site or a malware download – either due to a false report, or due to the customers page actually being infected – then they’ll add click.customer1.com to their blacklist, meaning no http://click.customer1.com/ URLs will work. So far, this isn’t a big problem.
But if they then go and check http://click.customer2.com/123/456/789.html and find the same redirect, they’ll blacklist click.customer2.com, and so on for all the clickthrough hostnames of yours they know about. That’ll cause any click on any URL in any email a lot of your customers send out to go to a “This site may harm your computer!” warning – which will end up a nightmare even if you spot the problem and get the filter operators to remove all those hostnames from the blacklist within a few hours or a day.
Don’t let this happen to you. Make sure your clickthrough webserver pays attention to the hostname as well as the path of the URL.
Use different hostnames for different customers clickthrough links. And if you pick a link from mail sent by Customer A, and change the hostname of that link to the clickthrough hostname of Customer B, then that link should fail with an error rather than displaying Customer A’s content.

Read More