Which is better UTF-8 or ISO-?

Someone asked today on a mailing list whether they should be using UTF-8 or “ISO” encoding for sending email. What’s the best choice depends on some of the details of the situation, but here’s the answer I gave:
UTF-8 will work for pretty much anything, as it’s just an 8 bit encoding scheme for Unicode (which is supposed to be the one character encoding to rule them all). It’s well supported in most languages and development environments – Windows has been native UTF-16 under the covers since the mid 90s, for instance – and typical messages that use mainstream glyphs should render well from utf-8 in most western MUAs and browsers.
There are still a very few old or broken clients out there that will not handle UTF-8 well but (outside the asian language market, where there’s still some non-ASCII, non-Unicode legacy usage) they’re typically ones that don’t really handle any character set encoding well and the only thing safe to send to them is either plain ASCII or whichever ASCII superset their OS happens to support natively (which is probably an argument for sending Windows-1252 codepage, but not a terribly strong one).
The various extended ASCIIs (such as ISO-8859-*) will only work for messages that are written solely using characters from that character set. If you have even one character in a message that cannot be expressed in ISO-8859-1, then you can’t use ISO-8859-1 to send that message.
ISO-8859-1 (aka Latin1) is fairly sloppy in some respects – it has no apostrophe, nor single quotes, for instance – but it can handle an awful lot of languages, from Kurdish to Swahili. It can’t handle Dutch, Estonian, Finnish, Hungarian and Welsh particularly well, nor can it show the Euro symbol (ISO-8859-14 or -15 are needed for some characters there).
A common problem is that many people (and the software they write) think that Windows uses Latin1. It doesn’t, it uses Windows-1252. If you accept messages written on Windows, using the Windows-1252 code page, and throw them out on the wire as ISO-8859-1 what you end up with is not quite right. It mostly works, as the two codepages overlap quite a bit, but they have different glyphs in the 0x80-0x9f range. So if you use single or double quotes (“smart quotes”), or the Euro symbol, or ellipses, or bullet, or the trademark symbol in your message they’ll be garbled. This is so common that some mail clients and web browsers will actually treat a document that claims to be ISO-8859-1 as Windows-1252, but that’s a bug workaround and not something it’s really safe to rely on.
If you’re doing personalized messages, and you’re sending one of them to Győző and one of them to Eiður then you may have to use different character sets for the two messages. If you’re talking about Győző and personalizing it for Eiður then you might find things break horribly.
Someone probably has some concrete data on mail client character set support, broken down by region and language, but my understanding is that this is a reasonable approach:

  1. If you’re sending any entirely non-ASCII messages (asian languages, primarily) then it’s way more complex than it should be, and you need to find someone who really understands the deployment of BIG5 and ShiftJIS and so on if you want to dial in to perfection. If you don’t have someone like that, UTF-8 is your best bet.
  2. If your tool chain supports non-ASCII messages, and you want to choose a single encoding, go with UTF-8. If the UTF-8 you end up sending is entirely, or almost entirely, ASCII then this will render well even on the tiny fraction of mail clients that don’t support character sets. (This is what I’d do).
  3. If your tool chain supports non-ASCII, but almost all your messages are plain ASCII and you want to be clever, encode each message separately. That’ll mean storing the templates and mailmerge data as UTF-8 probably, then converting the resulting message to ASCII if possible. That’s actually a lot simpler to do than it sounds, as a valid UTF-8 message that can be converted to ASCII is already an ASCII message, so all you need to do is set the charset to UTF-8 if there’s any byte bigger than 127 in the message or to us-ascii otherwise.
  4. If your toolchain was written to support only truly 8 bit character sets, or if that’s all you ever expect to send, then ISO-8859-15 is about as good as you can do.

If you’re accepting messages via a web front end, or any other path that could lead to stuff being copy and pasted or uploaded from Windows, look out for the Windows-1252 issues, and convert the provided message to whatever format you’re storing templates as (probably utf-8).
Then, once you have the message, be careful to send it over the wire as 7 bit ASCII, probably using quoted-printable encoding, to make sure it isn’t thrown away or transcoded to base64 en-route.

Related Posts

Update on Yahoo and the PBL

Last week I requested details about Yahoo rejections for IPs pointing to the PBL when the IP was not on the PBL. A blog reader did provide me with extremely useful logs documenting the problem. Thank you!
Based on my examination of the logs, this appears to be a problem only on some of the Yahoo! MXs. In fact, in the logs I was sent, the email was rejected from 2 machines and then eventually accepted by a third.
I have forwarded those logs onto Yahoo who are looking into the issue. I have also talked with one of the Spamhaus volunteers and Spamhaus is aware of the issue as well.
The right people are looking at the issue and Spamhaus and Yahoo are both working on fixing this.
Thanks for the reports and for the logs.

Read More

What is an email address? (part one)

Given we deal with email addresses every day, dozens or thousands or millions of them, it seems a bit strange to ask what an email address is – but given some of the problems people have with the grubbier corners of address syntax it’s actually an interesting question.
There are two real standards that define what is a valid email address and what isn’t. The most complex is RFC 5322 – Internet Message Format, which describes all sorts of things about the structure of an email, including what’s valid to put in From: and To: headers. It’s really too liberal in what it allows an email address to look like to be terribly useful, but it does provide for one very commonly used feature – the friendly from where the name that’s displayed to the recipient is not just the email address.

Read More

AOL checking DKIM

Sources tell me that AOL announced on yesterday’s ESPC call that they are now, and have been for about a week, checking DKIM inbound. This fits with a conversation I had with one of the AOL delivery team a month or so back where they were asking me about what senders would be most concerned about when / if AOL started using DKIM.
The other announcement is that AOL, like Yahoo, would like to know how you categorize your outgoing mail stream as part of the whitelisting process.
Both of these changes indicate to me that AOL will be improving the granularity of their filtering scheme. DKIM signing will let them separate out different domains and different reputations across a single sending IP address. The categorization will allow AOL to evaluate sender statistics within the context of the specific type of email. Transactional mail can have different statistics from newsletters from marketing mail. Better granularity means that poor senders will be less able to hide behind good senders. I expect to hear some wailing and gnashing of teeth about this change, but as time goes on senders will clean up their stats and their policies and, as a consequence will see their delivery improve everywhere, not just AOL.

Read More