Internationalisation (part 1)

I

There’s been a gentle bit of uproar recently about ICANN finally beginning the process of rolling out support for internationalized domain names (IDN) at the DNS root and the effect that may have on email senders. Even if you haven’t noticed the uproar, it’s still a subject you probably want to be familiar with if you’re sending email.
What are internationalised domain names?
An internationalised domain name is simply a domain name that uses non-ascii characters – most anything other than a-z, 0-9 and ‘-‘ – such as those used in these URLs: http://пример.испытание/ or http://例子.測試/ (If those links are unreadable or don’t work, it means that your browser isn’t handling IDN well or doesn’t have the appropriate fonts installed yet).
They’re an obvious thing to want, especially if you’re from anywhere other than an anglophone country, but the Internet was originally built as an ascii-only network, and under the covers it still is entirely ascii-only, so layering non-ascii characters on top has taken a lot of work and time to roll out. IDN development dates back to at least 1996 and it has been supported by some top level domains since 2003. So the recent announcement to support non-ascii top level domains is just the latest step in a long and careful process.
Almost all of the underlying internet protocols are still ASCII based though, including DNS and SMTP, so a lot of the internationalisation work involves mapping non-ASCII words onto ASCII strings before they’re passed to the network, and mapping them back again before they’re displayed to the user. This is done in a fairly ad-hoc way, different in different protocols.
If you were to visit the cyrillic URL I mentioned above then the first thing your web browser would do would be to take the cyrillic string “пример.испытание” and translate it to the ASCII hostname “xn--e1afmkfd.xn--80akhbyknj4f” then look that up in the DNS to find the server handling that URL.
If you were to display that on a webpage or in an HTML email it might be converted to ASCII as”http://приме р.и с п ы тание/”.
If you were to send it as part of a plain text email, encoded as UTF8/quoted-printable, it would look like “http://%D0%BF%D1%80%D0%B8%D0%BC%D0%B5%D1%80.%D0%B8%D1%81%D0%BF%D1%8B%D1%82%D0%B0%D0%BD%D0%B8%D0%B5/”. If there’s a lot of non-ASCII characters in the message then it’s more likely to be encoded as UTF8/base64: “aHR0cDovL9C/0YDQuNC80LXRgC7QuNGB0L/Ri9GC0LDQvdC40LUvCg==”.
And all of those will (or at least should) be displayed to the end user identically.
Confused yet? That’s fine. Internationalisation on the Internet is a very complex and inconsistent subject. In my next post I’ll try and narrow down which bits of it you need to worry about when it comes to sending email and to not upsetting phishing or spam filters at the recipients ISP.

About the author

2 comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

By steve

Recent Posts

Archives

Follow Us