Which is better UTF-8 or ISO-?

Someone asked today on a mailing list whether they should be using UTF-8 or “ISO” encoding for sending email. What’s the best choice depends on some of the details of the situation, but here’s the answer I gave:
UTF-8 will work for pretty much anything, as it’s just an 8 bit encoding scheme for Unicode (which is supposed to be the one character encoding to rule them all). It’s well supported in most languages and development environments – Windows has been native UTF-16 under the covers since the mid 90s, for instance – and typical messages that use mainstream glyphs should render well from utf-8 in most western MUAs and browsers.
There are still a very few old or broken clients out there that will not handle UTF-8 well but (outside the asian language market, where there’s still some non-ASCII, non-Unicode legacy usage) they’re typically ones that don’t really handle any character set encoding well and the only thing safe to send to them is either plain ASCII or whichever ASCII superset their OS happens to support natively (which is probably an argument for sending Windows-1252 codepage, but not a terribly strong one).
The various extended ASCIIs (such as ISO-8859-*) will only work for messages that are written solely using characters from that character set. If you have even one character in a message that cannot be expressed in ISO-8859-1, then you can’t use ISO-8859-1 to send that message.
ISO-8859-1 (aka Latin1) is fairly sloppy in some respects – it has no apostrophe, nor single quotes, for instance – but it can handle an awful lot of languages, from Kurdish to Swahili. It can’t handle Dutch, Estonian, Finnish, Hungarian and Welsh particularly well, nor can it show the Euro symbol (ISO-8859-14 or -15 are needed for some characters there).
A common problem is that many people (and the software they write) think that Windows uses Latin1. It doesn’t, it uses Windows-1252. If you accept messages written on Windows, using the Windows-1252 code page, and throw them out on the wire as ISO-8859-1 what you end up with is not quite right. It mostly works, as the two codepages overlap quite a bit, but they have different glyphs in the 0x80-0x9f range. So if you use single or double quotes (“smart quotes”), or the Euro symbol, or ellipses, or bullet, or the trademark symbol in your message they’ll be garbled. This is so common that some mail clients and web browsers will actually treat a document that claims to be ISO-8859-1 as Windows-1252, but that’s a bug workaround and not something it’s really safe to rely on.
If you’re doing personalized messages, and you’re sending one of them to Győző and one of them to Eiður then you may have to use different character sets for the two messages. If you’re talking about Győző and personalizing it for Eiður then you might find things break horribly.
Someone probably has some concrete data on mail client character set support, broken down by region and language, but my understanding is that this is a reasonable approach:

  1. If you’re sending any entirely non-ASCII messages (asian languages, primarily) then it’s way more complex than it should be, and you need to find someone who really understands the deployment of BIG5 and ShiftJIS and so on if you want to dial in to perfection. If you don’t have someone like that, UTF-8 is your best bet.
  2. If your tool chain supports non-ASCII messages, and you want to choose a single encoding, go with UTF-8. If the UTF-8 you end up sending is entirely, or almost entirely, ASCII then this will render well even on the tiny fraction of mail clients that don’t support character sets. (This is what I’d do).
  3. If your tool chain supports non-ASCII, but almost all your messages are plain ASCII and you want to be clever, encode each message separately. That’ll mean storing the templates and mailmerge data as UTF-8 probably, then converting the resulting message to ASCII if possible. That’s actually a lot simpler to do than it sounds, as a valid UTF-8 message that can be converted to ASCII is already an ASCII message, so all you need to do is set the charset to UTF-8 if there’s any byte bigger than 127 in the message or to us-ascii otherwise.
  4. If your toolchain was written to support only truly 8 bit character sets, or if that’s all you ever expect to send, then ISO-8859-15 is about as good as you can do.

If you’re accepting messages via a web front end, or any other path that could lead to stuff being copy and pasted or uploaded from Windows, look out for the Windows-1252 issues, and convert the provided message to whatever format you’re storing templates as (probably utf-8).
Then, once you have the message, be careful to send it over the wire as 7 bit ASCII, probably using quoted-printable encoding, to make sure it isn’t thrown away or transcoded to base64 en-route.

Related Posts

What is an email address? (part two)

Yesterday I talked about the technical definitions of an email address. Eventually on Monday I’m going to talk about some useful day-to-day rules about email address acquisition and analysis, but first I’m going to take a detour into tagging or mailboxing email addresses.
Tagging an email address is something the owner of an email address can do to make it easier to handle incoming email. It works by adding an extra word to the local part of the email address separated by a special character, such as “+”, “=” or “-“. So, if my email address is steve@example.com, and I’m signing up for the MAAWG mailing lists I can sign up with the email address steve+maawg@example.com. When mail is sent to steve+maawg@example.com it will be delivered to my steve@example.com mailbox, but I’ll know that it’s mail from MAAWG. I can use that tag to whitelist that mail, to filter it to it’s own mailbox and a bunch of other useful things.
In some ways this is similar to recent disposable email address services, but rather than being a third party service it’s something that’s been built in to many mailservers for well over a decade. It doesn’t require me to create each new address at a web page, instead I can make tags up on the fly. And it works at my regular mail domain.
If you’re an ESP it can be interesting to look for tagged addresses in uploaded lists. If it’s a list owned by Kraft and you see the email address steve+gevalia@example.com in the list, that’s a strong sign that that email address at least was really volunteered to the list owner. If you see the email address steve+microsoft@example.com then it’s a strong sign that it wasn’t, and you might want to look harder at where the list came from.
One reason that this is relevant to email address capture is that tagged addresses are something that you should expect people, especially more sophisticated users of email, to use to sign up to mailing lists and that they’re something you don’t want to discourage. Yet many web signup forms forbid entering email addresses with a “+” or, worse, have bugs in them that map a “+” sign in the email address to a space – leading to the signup failing at best, or the wrong email address being added to the list at worst. This really annoys people who use tagged addresses to help manage their email, and they’re often exactly the sort of tech-savvy people who make a lot of online purchases you want to have on your lists.
More on Monday.

Read More

AOL and DKIM

Yesterday, on an ESPC call, Mike Adkins of AOL announced upcoming changes to the AOL reputation system. As part of these changes, AOL will be checking DKIM on the inbound. Best estimates are that this will be deployed in the first half of 2009, possibly in Q1. This is something AOL has been hinting at for most of 2008.
As part of this, AOL has deployed an address where any sender can check the validity of a DKIM signature against the AOL DKIM implementation. To check a signature, send an email to any address at dkimtest.aol.com.
I have done a couple of tests, from a domain not signing with either DK or DKIM, from a domain signing with DK and from a domain signing with both DK and DKIM. In all cases, the mail is rejected by AOL. The specific rejection messages are different, however.
Unsighng domain: host dkimtest-d01.mx.aol.com[205.188.103.106] said: 554-ERROR: No DKIM header found 554 TRANSACTION FAILED (in reply to
end of DATA command)
DK signing domain: “205.188.103.106 failed after I sent the message.
Remote host said: 554-ERROR: No DKIM header found
554 TRANSACTION FAILED”
DK/DKIM signing domain: “We tried to delivery your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 554 554-PASS: DKIM authentication verified
554 TRANSACTION FAILED (state 18).”
As you can see, in all cases mail is rejected from that address. However, when there is a valid DKIM signature, the failure message is “554-PASS.”
As I have been recommending for months now, all senders should be planning to sign with DKIM early in 2009. AOL’s announcement that they will be using DKIM signatures as part of their reputation scoring system is just one more reason to do so.

Read More

Troubleshooting the simple stuff

I was talking with one of my Barry pals recently and was treated to a rant regarding deliverability experts that can’t manage simple things. We’ve been having an ongoing conversation recently about the utterly stupid and annoying questions some senders ask. Last week, I was ranting about a delivery person asking what “5.7.1. Too many receipts this session” meant. This morning I got an IM.

Read More