What is an email address? (part three)

W

As promised last week, here are some actual recommendations for handling email addresses.
First some things to check when capturing an email address from a user, or when importing a list. These will exclude some legitimate email addresses, but not any that anyone is likely to actually be using. And they’ll allow in some email addresses that are technically not legal, by erring on the side of simple checks. But they’re an awful lot better than many of the existing email address filters.

  • Check the email address immediately, so you can ask the user to correct it if it seems wrong
  • If you reject the email address, leave it in the text entry box so the user can see what they typed, and correct it
  • Trim leading and trailing spaces from the entered address, to clean up cut and paste errors
  • Check that there’s just one “@” in the address
  • Check that the domain part of the address, to the right of the “@”, consists only of ascii characters – letters, digits, “.”, “-” or “_” (underscores are technically not allowed, but do show up in badly implemented windows mail systems occasionally, and delivery to them mostly works)
  • Check that the domain part starts with a letter or a digit
  • Check that the domain part ends with a letter
  • Check that the domain part contains at least one “.” and doesn’t contain more than one “.” in a row
  • Don’t try and check the right hand side of the domain part against a fixed list of valid top level domains (while you can get away with doing this in theory, if you regularly maintain the code and automatically compare it against the list of domains maintained by internic and so on, in practice you’re not going to do that and next year there’ll be a new top level domain and it’ll all go horribly wrong)
  • If possible do a DNS lookup on the domain part of the address for an MX record, and if that fails an A record – reject the address if you get an NXDOMAIN or NOERROR response to both. If the query returns a SERVFAIL or takes longer than a few seconds, give the address the benefit of the doubt and accept it.
  • Either… check that the local part of the address doesn’t contain whitespace, commas, double or single quotes, backticks, parentheses or angle brackets (while these are technically allowed if quoted correctly they’re not characters that appear in normal email addresses, and they risk breaking your data handling at some point in the path)
  • Or… check that the local part of the email address consists solely of letters, digits or any of + – = . _ . This is stricter than the previous suggestion, but should still allow any normal email address
  • If the local part starts with “mailto:”, silently remove that. It’s fairly common when copying an email address to the clipboard from a context menu that it’ll be copied as a URL, with a leading mailto:, rather than as an email address
  • Don’t put the email address through one of the standard “bad words” filters. Bob at Pen Island dot net may well want your mail, and he’s going to be upset if you tell him he has a potty mouth

Things to test, both in your address capture and in the rest of your infrastructure:

  • Test your address capture with email addresses with leading and trailing spaces
  • Test that your address capture stores email addresses with “+” or “=” in them correctly
  • If you’re doing DNS testing, check that an email address where the domain part does not have an MX record but does have an A record is accepted
  • If you’re doing DNS testing, make sure that you check some email addresses at domains that don’t belong to you
  • Make sure that your VERP string generation, bounce handling, unsubscription URL generation and unsubscription handling all handle any of the characters you allow in the local part. “=” or “-” may be problems in VERP strings, while “+”, “&” and “;” are commonly problems in URLs.

Canonicalising email addresses:

  • If possible don’t fold case on email addresses at all when you store them, so that you can return them to the user the way they were given to you. Bob@PenIsland.net is not the same as bob@penisland.net
  • If you need to case fold an email address due to limitations in your data handling back end, fold it to lower case instead of upper. Many people have all lower case email addresses anyway, and they won’t notice the change, while all upper case reads like SHOUTING
  • Don’t try and strip off something that looks like a tag in the local part – it may not be a tag, and even if it is the user wants it there and is probably using it

Comparing email addresses:

  • Compare email addresses case-insensitively. While the local part of the email address is technically case-sensitive, most receivers treat them case-insensitively. And if you ever do run across two people whose email addresses differ only in case, it’s not accidental. Probably a deliverability blogger or an art project or something.

There are a couple of things I’m not sure about, and where I’d definitely appreciate comments.
There are many places where any email address containing the four letters S, P, A and M is silently discarded. That’s something that’s given our friends at spamex.com, amongst other places, some serious problems. I don’t think it’s a sensible thing to do in general, not least because  an email address that’s been specifically created for a mailing lists signup, such as the email addresses at spamex, is a perfectly reasonable thing to be given as part of a mailing list signup. On the other hand, I know very smart people at large ESPs who consider it very sensible operational behaviour, at least, to discard any email address with “spam” in it. Any thoughts? Does it make any difference whether “spam” is in the local part or the domain part?
It’s pretty common to ask people to enter their email addresses twice, and require that they match before accepting the signup. Does this actually improve accuracy at all? Has anyone measured it? Or is it just a parallel to double entering a password change, or a misunderstanding of the meaning of “double opt-in” that’s spread from site to site via monkey-see, monkey-do?

About the author

3 comments

Leave a Reply to Alec Saiko

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  • There are a number of rules you listed that will reject valid email addresses. In section 3 of rfc3696 http://tools.ietf.org/html/rfc3696 there is a list of “weird” but valid email addresses:
    * Abc@def@example.com
    * Fred Bloggs@example.com
    * Joe.\Blow@example.com
    * “Fred Bloggs”@example.com
    * “Abc@def”@example.com
    * customer/department=shipping@example.com
    * $A12345@example.com
    * !def!xyz%abc@example.com
    * _somename@example.com
    BTW your rules dischard most invalid email, maybe they could be used to alert the user about a possible typo because a full rfc2821 parser in javascript would be overkill.

  • They’ll also reject
    “bob@tv”
    and
    “eli (Pogonatus (latin for ))@ (qz (pronounced (queasy) )
    .little-neck (I did not want that, but RFC1480 required it) .ny (New
    F%@!: York) .us (USA) or ) netusa (Located on Long Island) . net (Elijah)”
    … which are both valid email addresses, by some measures. If you’re writing software to deliver email, or to display it, you should really need to be aware of the strict rules and be able to handle any syntactically valid address.
    Email address capture for mailing list signup is a different beast, though. The trick there is to accept any email address which a legitimate user might want to sign up for a mailing list with, while also rejecting entries which are likely to be typos or mistakes, or which are likely to cause data handling or delivery problems at some later point.
    The rules of thumb I give are, I think, fairly close to the sweet spot for that (unless there’s something obvious I’ve forgotten, which is quite possible). They’ll reject an awful lot of syntactically valid “stunt” email addresses – the sort of addresses you have in the QA suite for an MTA – but they’re not addresses real users actually use, and for this purpose that’s a feature rather than a bug.
    A full RFC 2821, or even 2822, email address parser in javascript wouldn’t be too difficult – it could be done in a few lines with a single regex and a loop – but it’s not something that’d really be that useful apart from as proof of your ‘leet regex skills. 🙂
    (See http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html for the level of regex ‘leetness you’d need)

  • I agree with Stefano and Steve – it’s almost impossible to automatically check the validity of an email address on submission. You can, however, check for the presence of ‘@’ character, correct domain spelling against a local domain table (‘hotmial’ -> ‘hotmail’) – maybe even MX record for the domain.
    However, it is better to employ other verification methods – like a double opt-in, for example. In this case you ensure you get a more engaged subscriber and don’t pollute your sending IP with invalid emails.

By steve

Recent Posts

Archives

Follow Us