As promised last week, here are some actual recommendations for handling email addresses.
First some things to check when capturing an email address from a user, or when importing a list. These will exclude some legitimate email addresses, but not any that anyone is likely to actually be using. And they’ll allow in some email addresses that are technically not legal, by erring on the side of simple checks. But they’re an awful lot better than many of the existing email address filters.
- Check the email address immediately, so you can ask the user to correct it if it seems wrong
- If you reject the email address, leave it in the text entry box so the user can see what they typed, and correct it
- Trim leading and trailing spaces from the entered address, to clean up cut and paste errors
- Check that there’s just one “@” in the address
- Check that the domain part of the address, to the right of the “@”, consists only of ascii characters – letters, digits, “.”, “-” or “_” (underscores are technically not allowed, but do show up in badly implemented windows mail systems occasionally, and delivery to them mostly works)
- Check that the domain part starts with a letter or a digit
- Check that the domain part ends with a letter
- Check that the domain part contains at least one “.” and doesn’t contain more than one “.” in a row
- Don’t try and check the right hand side of the domain part against a fixed list of valid top level domains (while you can get away with doing this in theory, if you regularly maintain the code and automatically compare it against the list of domains maintained by internic and so on, in practice you’re not going to do that and next year there’ll be a new top level domain and it’ll all go horribly wrong)
- If possible do a DNS lookup on the domain part of the address for an MX record, and if that fails an A record – reject the address if you get an NXDOMAIN or NOERROR response to both. If the query returns a SERVFAIL or takes longer than a few seconds, give the address the benefit of the doubt and accept it.
- Either… check that the local part of the address doesn’t contain whitespace, commas, double or single quotes, backticks, parentheses or angle brackets (while these are technically allowed if quoted correctly they’re not characters that appear in normal email addresses, and they risk breaking your data handling at some point in the path)
- Or… check that the local part of the email address consists solely of letters, digits or any of + – = . _ . This is stricter than the previous suggestion, but should still allow any normal email address
- If the local part starts with “mailto:”, silently remove that. It’s fairly common when copying an email address to the clipboard from a context menu that it’ll be copied as a URL, with a leading mailto:, rather than as an email address
- Don’t put the email address through one of the standard “bad words” filters. Bob at Pen Island dot net may well want your mail, and he’s going to be upset if you tell him he has a potty mouth
Things to test, both in your address capture and in the rest of your infrastructure:
- Test your address capture with email addresses with leading and trailing spaces
- Test that your address capture stores email addresses with “+” or “=” in them correctly
- If you’re doing DNS testing, check that an email address where the domain part does not have an MX record but does have an A record is accepted
- If you’re doing DNS testing, make sure that you check some email addresses at domains that don’t belong to you
- Make sure that your VERP string generation, bounce handling, unsubscription URL generation and unsubscription handling all handle any of the characters you allow in the local part. “=” or “-” may be problems in VERP strings, while “+”, “&” and “;” are commonly problems in URLs.
Canonicalising email addresses:
- If possible don’t fold case on email addresses at all when you store them, so that you can return them to the user the way they were given to you. Bob@PenIsland.net is not the same as email@example.com
- If you need to case fold an email address due to limitations in your data handling back end, fold it to lower case instead of upper. Many people have all lower case email addresses anyway, and they won’t notice the change, while all upper case reads like SHOUTING
- Don’t try and strip off something that looks like a tag in the local part – it may not be a tag, and even if it is the user wants it there and is probably using it
Comparing email addresses:
- Compare email addresses case-insensitively. While the local part of the email address is technically case-sensitive, most receivers treat them case-insensitively. And if you ever do run across two people whose email addresses differ only in case, it’s not accidental. Probably a deliverability blogger or an art project or something.
There are a couple of things I’m not sure about, and where I’d definitely appreciate comments.
There are many places where any email address containing the four letters S, P, A and M is silently discarded. That’s something that’s given our friends at spamex.com, amongst other places, some serious problems. I don’t think it’s a sensible thing to do in general, not least because an email address that’s been specifically created for a mailing lists signup, such as the email addresses at spamex, is a perfectly reasonable thing to be given as part of a mailing list signup. On the other hand, I know very smart people at large ESPs who consider it very sensible operational behaviour, at least, to discard any email address with “spam” in it. Any thoughts? Does it make any difference whether “spam” is in the local part or the domain part?
It’s pretty common to ask people to enter their email addresses twice, and require that they match before accepting the signup. Does this actually improve accuracy at all? Has anyone measured it? Or is it just a parallel to double entering a password change, or a misunderstanding of the meaning of “double opt-in” that’s spread from site to site via monkey-see, monkey-do?