DKIM Canonicalization – or – why Microsoft breaks your mail

thx1138
One of these things is just like the other

Canonicalization is about comparing things to see if they’re the same. Sometimes you want to do a “fuzzy” comparison, to see if two things are interchangeable for your purposes, even if they’re not exactly identical.
As a concrete example, these two email addresses:

  • (Steve) steve@wordtothewise.com
  • “Also Steve” <steve@WORDTOTHEWISE.COM>

They’re clearly not identical, but they’ll deliver to the same maibox.
I could compare them with a set of comparison rules (if the string to the left of the @ sign up to white space is the same in both, or one of them has a less than sign in front of it, and the string to the right of the @ is the same, compared case-insensitively, ….).
Or I could canonicalize both email addresses and see if the results are identical. A simple canonicalization algorithm might be “Remove anything in parentheses. Remove any quoted strings. Strip any whitespace and any greater than or less than signs at each end. Convert anything after the @ sign to lower-case”. That’ll give two canonicalized email addresses:

  • steve@wordtothewise.com
  • steve@wordtothewise.com

They’re trivially identical, so I know that the two email addresses I started with are interchangeable.
DKIM validation is all about comparing whether things have changed or not. The DKIM-Signature header contains a “fingerprint” of the canonicalized message body and of the canonicalized headers from when the mail was sent. If you canonicalize the body and headers that you received, take the fingerprint in the same way and that’s identical to the one in the DKIM-Signature header, you know the headers and body haven’t been modified since the message was signed (and then you can do some DNS lookups and some cryptography to find who signed the message).
Unfortunately, email was never designed to send messages unchanged. Intermediate servers will often “fix up” messages – by folding long lines, normalizing whitespace, adding missing headers or fixing up invalid ones, by re-encoding content to different encodings and all sorts of other changes. That would break a byte-for-byte comparison of the mail as sent and as received. And because we don’t have a copy of the mail as sent to compare with – we only have a “fingerprint” of it – we can’t do any fancy comparison. So we have to rely on the canonicalization, and hope that even after the “fix ups” made during delivery the canonicalized forms – and hence the fingerprints – will be identical.

DKIM canonicalization

DKIM defines two canonicalization algorithms for the body of the message, simple and relaxed.
Simple body canonicalization does very little: it just strips any blank lines at the end of the body. Relaxed body canonicalization strips those blank lines, and then replaces any run of white space – spaces or tabs – in the body with a single space. This means that any change in whitespace in the body, such as converting tabs to spaces, won’t affect the relaxed canonicalized body.
DKIM also defines two canonicalization algorithms for the headers of the message. They’re also called simple and relaxed, despite doing quite different things. (Yes, this is confusing.)
Simple header canonicalization is as simple as you can get. It makes no changes, so the headers must be byte-for-byte identical to match. Relaxed header canonicalization converts all header names to lower case, unfolds headers so each is a single line, replaces any run of white space with a single space character and removes any trailing whitespace on each line.
(See the DKIM spec if you want all the details.)
The simple takeaway from this is that simple canonicalization makes DKIM signatures that are broken by even trivial modifications in transit, while relaxed canonicalization makes them more robust.
The really simple takeaway is “use relaxed canonicalization”.

The c= field

The canonicalization you use is recorded in the c= field of the DKIM-Signature header, with the two canonicalization names separated by a slash, header first.
So “c=strict/relaxed” means to use strict canonicalization for the headers and relaxed for the body.
You can also use just a single canonicalization type in the c= field. This does not do what you expect.
“c=strict” is exactly the same as “c=strict/strict”. “c=relaxed” is exactly the same as “c=relaxed/strict”.
Yes, this makes no sense. But it’s what the spec says. If you use just “c=relaxed” you’re using strict canonicalization for the body of the message, and any change to whitespace in the body will break your signature.

And Microsoft?

Microsoft have a long history of modifying email in transit, often to “fix up” differences between standard Internet email and the expectations of their internal code. This article goes into some of how that breaks DKIM in some cases.
It appears that some paths through outlook.com from MX to inbox are converting tabs in the body of the message into spaces at the moment, while other paths aren’t. If you’re using strict DKIM body canonicalization – either intentionally or accidentally with “c=relaxed” – that means you’ll see apparently random DKIM failures for mail sent to recipients hosted by outlook.com, but only for messages where the body is susceptible to whitespace damage.
Using the right (“c=relaxed/relaxed” unless you have a good reason not to) canonicalization is a good start but you should also look at making the content you send as clean as possible, beyond just complying with the email standards avoid structures that risk being rewritten in transit. But that’s another post.

Related Posts

Microsoft deprecating SmartScreen filters

At the beginning of the month Microsoft announced that they were deprecating the SmartScreen filters used by the desktop Microsoft mail clients. These are the filters used in Exchange and various version of Outlook mail. This is yet further consolidation of spam filtering between the Microsoft free webmail domains, Office365 hosted domains and self hosted Exchange servers.  The online services (hotmail.com, outlook.com, Office365, live.com, etc) have been  using these filters for a while. The big change now is that they’re being pushed down to Exchange and Outlook users not hosted on the Microsoft site.
EOP was developed for Outlook.com (and friends) as well as Office365 users. From Microsoft’s description, it sounds like the type of machine learning engine that many providers are moving to.
Microsoft has published quite a bit of information about these filters and how they work on their website. One of the best places to start is the Anti-spam Protection FAQ. Something senders should pay attention to is the final question on that page: “What are a set of best outbound mailing practices that will ensure that my mail is delivered?” Those are all things  deliverability folks recommend for good inbox delivery.
Poking around looking at the links and descriptions, there is a host of great information about spam filtering at Microsoft and how it works.
A page of note is their Exchange Online Protection Overview. This describes the EOP process and how the filters work.
MS_filterProcess

Read More

Should you publish DMARC?

secure_email_blogI’ve been hearing a lot lately about DMARC. Being at M3AAWG has increased that. Last night we were at dinner and heard from the next table “And they’re not even publishing DMARC!!!!”
I know DMARC is the future. I know folks are going to have to start publishing DMARC records. I also know that the protocol is the future. I am also not sure that most companies are ready for DMARC.
So lets take a step back and talk about DMARC, what it is and why I’m still a little hesitant to jump on the PUBLISH DMARC NOW!! bandwagon.

Read More