Internationalisation (part 2)
In part 1 I talked about internationalised domain names, and how they were mapped onto ASCII strings.
For sending email there are four bits of the message where internationalisation might need to be considered.
- Sender or recipient email address
- Header content, such as the Subject line or the “friendly” name in the To or From
- The visible body of the message
- The web URLs the body of the message links to
Some of these have been standardised for many years, and should already be supported fine by your current software, but others are new or don’t even exist yet, and might cause some new problems.
1. Email Addresses
Right now there’s only (very) experimental support for internationalised email addresses, so it’s not something we need to think about in too much detail right now (RFCs 4952, 5335 and 5336 are a good place to start if you crave detail).
They are coming sooner or later, though, and you might see people trying to use them, at least with non-ascii domain parts, sooner than that. See the ICANN IDN wiki for some example autoresponders.
It’s probably a good idea to start checking whether the bits of your infrastructure that capture, store and report on email addresses handle UTF-8 non-ascii addresses, and to make sure that any new apps you develop support it.
2. Header Content
Non-ascii header content is already well supported by pretty much all of the email infrastructure. Your email composition app most likely supports non-ascii subject lines already. You might want to check whether your address import and mail generation software can handle non-ascii friendly names in the To header, though.
3. Message Body
Again, this is pretty well supported and even if you don’t think you’re using it, you probably are if you’re pasting Word docs into your email.
4. Embedded URLs
Here is where things get interesting.
Internationalised URLs in plain text email
The best you can do in plain text email is to embed the internationalised URL in the plain text part of the message using the right encoding – like this http://例子.測試/ – and hope that the URL is clickable. Many MUAs don’t recognise that as a clickable link and that’s unlikely to change. You may not care, assuming that the vast majority of your recipients are reading an HTML version of your message, but if you have any compliance required links, such as an unsubscription link, it may be safer to use an alternate URL that’s purely ascii.
Internationalised URLs in HTML email
There’s quite a lot of concern in the security community about use of internationalised domain names to create “lookalike” URLs for use in phishing. For instance, pаypal.com contains a cyrillic “а” rather than an ascii “a”, so a phisher could register that domain and convince people it was the real paypal. Wikipedia has more details on the attack, but for our purposes we mostly need to know that spam and phishing filters are likely to be increasingly paranoid about URL usage.
There are four main ways you might put an internationalised URL in a message:
- <a href=”http://例子.測試/”>Click here</a>
- <a href=”xn--fsqu00a.xn--g6w251d”>Click here</a>
- <a href=”http://例子.測試/”>http://例子.測試/</a> (or <a href=”http://例子.測試/”>例子.測試</a>)
- <a href=”xn--fsqu00a.xn--g6w251d”>http://例子.測試/</a> (or <a href=”xn--fsqu00a.xn--g6w251d”>例子.測試</a>)
Best practice is already, for ascii links, to use something like the first two examples. That’s because many phishing filters are very suspicious about links where the visible text of the link looks like a URL or a hostname, but it’s different to where the link points at – and even if they’re identical when composed they’ll often be rewritten to be different by click tracking tools used by the ESP sending the mail.
For internationalised URLs that advice is likely to be even better practice, as it’s unclear how any given filter is going to treat the encoding issues and significantly more likely that they’ll consider a link that has a hostname in the visible text to be a sign of phishing. So encourage your users to avoid the third and fourth types.
Which of the first two options is better isn’t clear yet. Format 1 will (probably) give a slightly better user experience as it’s more likely to show the friendly hanzi link rather than the unfriendly encoded link to the user (while hovering over it, for example) but it’s possible that some mail clients may break the first format and work with the second format. I’d go with the first format unless testing with real world mail clients shows problems.
As with the plain text URLs, until there’s enough operational experience to show that internationalised URLs work perfectly in all mail clients it’s probably a good idea to use plain ascii URLs for unsubscribe links and any other compliance required links.
If you’re tracking clicks by routing all the web links in the email through your own webserver that redirects to the customers real website (and almost everyone does that) then you’ll also need to make sure that your click tracking system can handle internationalised URLs.
That means that it’ll have to accept from the user URLs that look like “http://例子.測試/首頁” and display them in that format in reports and statistics to users, but redirect at the HTTP level using a header like “Location: http://xn--fsqu00a.xn--g6w251d/%E9%A6%96%E9%A0%81” where the original UTF-8 URL is encoded in two parts – the hostname in punycode and the rest of the URL in percent encoding.
It will also have to return a valid Content-Type header (typically “Content-Type: text/html; charset=utf-8”) which is something you can get away without doing for simple ASCII redirection, so it may not be handled by your existing software.
More demand for internationalised domain names is coming. Start testing your workflow now, so that you can tell customers “Yes, we support that.” when they start asking for them.