Character encoding
- steve
- Best practices
- May 3, 2011
This morning, someone asked an interesting question.
Last time I worked with the actual HTML design of emails (a long time ago), <head> was not really needed. Is this still true for the most part? Any reason why you still want to include <head> + meta, title tags in emails nowadays?
There are several bits of information in the <head> part of an HTML document that can affect the rendering of it – there’s the doctype, which will control the html rendering model, there’s often some css which will control the styling, and there’s often a meta tag that states what character set is used in the document.
That last one is interesting in the case of a piece of HTML that’s being sent as part of a MIME email – as MIME already has a perfectly good way of specifying the character set a message has, as part of the Content-Type header. I looked at a few bulk messages I’d received recently and, sure enough, most of them include the <head> section, and have a meta tag in there that defines the character set. All of them have a character set defined in the Content-Type header. Sometimes those character sets didn’t match:
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7Bit
<html>
<head>
<title></title>
<meta http-equiv=”Content-Type” content=”text/html; charset=windows-1252″>
<meta name=”title” content=”New CS5.5 Web Premium” />a snippet from this mornings email
What happens when they don’t match? I don’t think it’s defined anywhere. Time for some empirical testing.
Testing! For Science!
I needed to create some test emails which would be visibly different depending on which character set the mail client decided to use. I picked out two character sets – ISO-8859-15 and ISO-8859-16, as they differ from each other and from ISO-8859-1 enough that I could differentiate them just by the way two characters were rendered.
The byte 0xfd renders as e-with-a-tail (ę) in ISO-8859-16 and as y-acute (ý) in the other two character sets, while 0xa4 renders as the generic currency symbol (¤) in ISO-8859-1 and as a euro symbol (€) in the other two. I included the characters in two different ways in each test message – once as a raw character in the body of the message (=a4 or =fd in quoted-printable format), and once as a numeric HTML entity (¤ or ý).
This is what I found:
Mail client | Mime charset | HTML meta charset | Raw character | HTML entity |
---|---|---|---|---|
Mail.app | -15 | -16 | -15 | -1 |
Gmail | -15 | -1 | ||
Mail.app | -16 | -15 | -16 | -1 |
Gmail | -1 | -1 | ||
Mail.app | -15 | none | -15 | -1 |
Gmail | -1 | -1 | ||
Mail.app | none | -16 | broke | -1 |
Gmail | -1 | -1 | ||
Mail.app | us-ascii | -16 | broke | -1 |
Gmail | -1 | -1 |
There are several things to see from this data. The simple one first – regardless of which character set I declared, and where I declared it, both mail clients rendered characters written as HTML numeric entities (“¤”) consistently in ISO-8859-1. (This isn’t really a surprise, as it’s how the HTML specs define them.)
Raw characters were much less consistent. Mail.app consistently used the character set declared in the MIME Content-Type header when it was set to something reasonable, and ignored the encoding in the HTML meta tag. Giving it an unreasonable character set in the Content-Type header caused it to render 0xfd as a double dagger (‡), which makes no sense at all in any character set I can find. Gmail managed to render the raw character in ISO-8859-15 correctly, but gave up and fell back to using ISO-8859-1 for everything else.
Conclusions
There are a few things we can conclude from this, I think, even though it really needs some comparisons with different mail clients, and some testing with other character sets (including unicode and some of the asian sets).
- Don’t bother with putting HTML meta content-type tags in your HTML
- Send your text/html parts as plain 7 bit ascii, using HTML entities for non-ascii characters
- It might be less confusing to use named entities such as © rather than numeric ones such as ©
- If you’re generating numeric entities from user-generated input, be wary of input that’s not ISO-8859-1 or Windows-1252
- Character set conversion is hard, lets go unicode
I’ve made the test emails I used available for download. From a unix prompt, with swaks installed, you can send them like this:
for i in charset*.eml ; do swaks –to your@email.address –from your@email.address –server your.email.server –data – <$i; done