What is an email address? (part three)

As promised last week, here are some actual recommendations for handling email addresses.
First some things to check when capturing an email address from a user, or when importing a list. These will exclude some legitimate email addresses, but not any that anyone is likely to actually be using. And they’ll allow in some email addresses that are technically not legal, by erring on the side of simple checks. But they’re an awful lot better than many of the existing email address filters.

  • Check the email address immediately, so you can ask the user to correct it if it seems wrong
  • If you reject the email address, leave it in the text entry box so the user can see what they typed, and correct it
  • Trim leading and trailing spaces from the entered address, to clean up cut and paste errors
  • Check that there’s just one “@” in the address
  • Check that the domain part of the address, to the right of the “@”, consists only of ascii characters – letters, digits, “.”, “-” or “_” (underscores are technically not allowed, but do show up in badly implemented windows mail systems occasionally, and delivery to them mostly works)
  • Check that the domain part starts with a letter or a digit
  • Check that the domain part ends with a letter
  • Check that the domain part contains at least one “.” and doesn’t contain more than one “.” in a row
  • Don’t try and check the right hand side of the domain part against a fixed list of valid top level domains (while you can get away with doing this in theory, if you regularly maintain the code and automatically compare it against the list of domains maintained by internic and so on, in practice you’re not going to do that and next year there’ll be a new top level domain and it’ll all go horribly wrong)
  • If possible do a DNS lookup on the domain part of the address for an MX record, and if that fails an A record – reject the address if you get an NXDOMAIN or NOERROR response to both. If the query returns a SERVFAIL or takes longer than a few seconds, give the address the benefit of the doubt and accept it.
  • Either… check that the local part of the address doesn’t contain whitespace, commas, double or single quotes, backticks, parentheses or angle brackets (while these are technically allowed if quoted correctly they’re not characters that appear in normal email addresses, and they risk breaking your data handling at some point in the path)
  • Or… check that the local part of the email address consists solely of letters, digits or any of + – = . _ . This is stricter than the previous suggestion, but should still allow any normal email address
  • If the local part starts with “mailto:”, silently remove that. It’s fairly common when copying an email address to the clipboard from a context menu that it’ll be copied as a URL, with a leading mailto:, rather than as an email address
  • Don’t put the email address through one of the standard “bad words” filters. Bob at Pen Island dot net may well want your mail, and he’s going to be upset if you tell him he has a potty mouth

Things to test, both in your address capture and in the rest of your infrastructure:

  • Test your address capture with email addresses with leading and trailing spaces
  • Test that your address capture stores email addresses with “+” or “=” in them correctly
  • If you’re doing DNS testing, check that an email address where the domain part does not have an MX record but does have an A record is accepted
  • If you’re doing DNS testing, make sure that you check some email addresses at domains that don’t belong to you
  • Make sure that your VERP string generation, bounce handling, unsubscription URL generation and unsubscription handling all handle any of the characters you allow in the local part. “=” or “-” may be problems in VERP strings, while “+”, “&” and “;” are commonly problems in URLs.

Canonicalising email addresses:

  • If possible don’t fold case on email addresses at all when you store them, so that you can return them to the user the way they were given to you. Bob@PenIsland.net is not the same as bob@penisland.net
  • If you need to case fold an email address due to limitations in your data handling back end, fold it to lower case instead of upper. Many people have all lower case email addresses anyway, and they won’t notice the change, while all upper case reads like SHOUTING
  • Don’t try and strip off something that looks like a tag in the local part – it may not be a tag, and even if it is the user wants it there and is probably using it

Comparing email addresses:

  • Compare email addresses case-insensitively. While the local part of the email address is technically case-sensitive, most receivers treat them case-insensitively. And if you ever do run across two people whose email addresses differ only in case, it’s not accidental. Probably a deliverability blogger or an art project or something.

There are a couple of things I’m not sure about, and where I’d definitely appreciate comments.
There are many places where any email address containing the four letters S, P, A and M is silently discarded. That’s something that’s given our friends at spamex.com, amongst other places, some serious problems. I don’t think it’s a sensible thing to do in general, not least because  an email address that’s been specifically created for a mailing lists signup, such as the email addresses at spamex, is a perfectly reasonable thing to be given as part of a mailing list signup. On the other hand, I know very smart people at large ESPs who consider it very sensible operational behaviour, at least, to discard any email address with “spam” in it. Any thoughts? Does it make any difference whether “spam” is in the local part or the domain part?
It’s pretty common to ask people to enter their email addresses twice, and require that they match before accepting the signup. Does this actually improve accuracy at all? Has anyone measured it? Or is it just a parallel to double entering a password change, or a misunderstanding of the meaning of “double opt-in” that’s spread from site to site via monkey-see, monkey-do?

Related Posts

What is an email address? (part two)

Yesterday I talked about the technical definitions of an email address. Eventually on Monday I’m going to talk about some useful day-to-day rules about email address acquisition and analysis, but first I’m going to take a detour into tagging or mailboxing email addresses.
Tagging an email address is something the owner of an email address can do to make it easier to handle incoming email. It works by adding an extra word to the local part of the email address separated by a special character, such as “+”, “=” or “-“. So, if my email address is steve@example.com, and I’m signing up for the MAAWG mailing lists I can sign up with the email address steve+maawg@example.com. When mail is sent to steve+maawg@example.com it will be delivered to my steve@example.com mailbox, but I’ll know that it’s mail from MAAWG. I can use that tag to whitelist that mail, to filter it to it’s own mailbox and a bunch of other useful things.
In some ways this is similar to recent disposable email address services, but rather than being a third party service it’s something that’s been built in to many mailservers for well over a decade. It doesn’t require me to create each new address at a web page, instead I can make tags up on the fly. And it works at my regular mail domain.
If you’re an ESP it can be interesting to look for tagged addresses in uploaded lists. If it’s a list owned by Kraft and you see the email address steve+gevalia@example.com in the list, that’s a strong sign that that email address at least was really volunteered to the list owner. If you see the email address steve+microsoft@example.com then it’s a strong sign that it wasn’t, and you might want to look harder at where the list came from.
One reason that this is relevant to email address capture is that tagged addresses are something that you should expect people, especially more sophisticated users of email, to use to sign up to mailing lists and that they’re something you don’t want to discourage. Yet many web signup forms forbid entering email addresses with a “+” or, worse, have bugs in them that map a “+” sign in the email address to a space – leading to the signup failing at best, or the wrong email address being added to the list at worst. This really annoys people who use tagged addresses to help manage their email, and they’re often exactly the sort of tech-savvy people who make a lot of online purchases you want to have on your lists.
More on Monday.

Read More

What is an email address? (part one)

Given we deal with email addresses every day, dozens or thousands or millions of them, it seems a bit strange to ask what an email address is – but given some of the problems people have with the grubbier corners of address syntax it’s actually an interesting question.
There are two real standards that define what is a valid email address and what isn’t. The most complex is RFC 5322 – Internet Message Format, which describes all sorts of things about the structure of an email, including what’s valid to put in From: and To: headers. It’s really too liberal in what it allows an email address to look like to be terribly useful, but it does provide for one very commonly used feature – the friendly from where the name that’s displayed to the recipient is not just the email address.

Read More