Posted on

A coworker of mine, Andreas Birkebæk, received an email the other day, and sent it to me thinking that it was some variation of a unicode failure that plagues me regularly. While it was an encoding failure of sorts, this one was more interesting than your garden-variety UTF-8/ISO8859-1 problem.

Here's the email, with some of the relevant headers included, and some personal information redacted:

From: notifications@example-sender.com
Subject: Re: Re: Bestyrelsesmøde (Action Requested)
Date: 7 June 2017 at 15.27.13 GMT-6
To: Andreas Birkebæk <andreas@example.com>
Reply-To: notifications@example-sender.com

Hello =?utf-8?Q?Andreas_Birkeb=C3=A6k?=,

Hi there, You just sent me an email. But in case of spam mail I have
chosen to use a filter. I'll be more likely to see your email and
future messages, if you are on my priority Guest List.

From the body of the email we can guess that it originated from some sort of auto-responder/spam sender, which in and of itself doesn't really warrant any additional scrutiny. However, there is that wonderfully cryptic:

Hello =?utf-8?Q?Andreas_Birkeb=C3=A6k?=,

which conveys just enough information to give you an idea of what they were attempting to do ("Hello ,"). Well, you'd chuckle if you had the pleasure of reading through the various email-related RFCs, most notably RFC 822 and 2047, the latter of which will be the subject of our current discussion.

A Bit of Email RFC History

In the late 70's and early 80's, internet message formats (or what would later come to be known as "email") and their various transmission protocols were codified into several standards and published, and then later collected and now controlled by the IETF as RFCs. The original internet message format, RFC822, of course, assumed that email would only ever be written using the ASCII character encoding. One could call these early developers and internet protocol pioneers short-sighted for that, but you have to remember that, at the time, the only people communicating via these new-fangled devices and message formats were part of ARPANET, and the entire internet could be drawn on a piece of A4 paper.

Well, it turns out that people want to write emails in more than one language, and RFC2045 addressed most of that with the addition of a Content-Type header (among other things) so that mail message bodies could be written in whatever language that the author desired.

Using non-ASCII in message headers, however, posed a bit of a problem, mostly due to the fact that most message handling software at the time did weird things to email headers, including reordering addresses in To/CC fields, wrapping headers in arbitrary locations, along with a general inability to correctly parse headers that contained "special" characters like < or ,, even though a mechanism for escaping these characters was defined in RFC822.

Encoded Words for Non-ASCII Email Headers

This brings us to RFC2047, where:

the techniques outlined here were designed to allow the use of non-ASCII characters in message headers in a way which is unlikely to be disturbed by the quirks of existing Internet mail handling programs.

[…] certain sequences of "ordinary" printable ASCII characters (known as "encoded-words") are reserved for use as encoded data. The syntax of encoded-words is such that they are unlikely to "accidentally" appear as normal text in message headers. Furthermore, the characters used in encoded-words are restricted to those which do not have special meanings in the context in which the encoded-word appears.

Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.

Looking back at the original email that my coworker received, things start to make a bit more sense. It seems as though this somewhat badly configured spam-like auto-responder used the encoded-word format that was most likely intended for the Subject: header, but was instead used in the message body.

If we note that \xc3\x6k is the byte sequence for æ in UTF-8, we can almost see that Hello =?utf-8?Q?Andreas_Birkeb=C3=A6k?=, was actually supposed to be Hello Andreas Birkebæk,. A quick check with Python confirms my suspicions:

In [1]: import email

In [2]: from email.header import decode_header

In [3]: result = decode_header(u'=?utf-8?Q?Andreas_Birkeb=C3=A6k?=')

In [4]: print(result)
[('Andreas Birkeb\xc3\xa6k', 'utf-8')]

In [5]: print(result[0][0].decode(result[0][1]))
Andreas Birkebæk

Yet another reminder that no matter how hard you try to avoid them, character encoding problems can pop up in the strangest of places.