A coworker of mine, Andreas Birkebæk, received an email the other day, and sent it to me thinking that it was some variation of a unicode failure that plagues me regularly. While it was an encoding failure of sorts, this one was more interesting than your garden-variety UTF-8/ISO8859-1 problem.
Here’s the email, with some of the relevant headers included, and some personal information redacted:
From the body of the email we can guess that it originated from some sort of auto-responder/spam sender, which in and of itself doesn’t really warrant any additional scrutiny. However, there is that wonderfully cryptic:
which conveys just enough information to give you an idea of what they were
attempting to do (“Hello
A Bit of Email RFC History
In the late 70’s and early 80’s, internet message formats (or what would later come to be known as “email”) and their various transmission protocols were codified into several standards and published, and then later collected and now controlled by the IETF as RFCs. The original internet message format, RFC822, of course, assumed that email would only ever be written using the ASCII character encoding. One could call these early developers and internet protocol pioneers short-sighted for that, but you have to remember that, at the time, the only people communicating via these new-fangled devices and message formats were part of ARPANET, and the entire internet could be drawn on a piece of A4 paper.
Well, it turns out that people want to write emails in more than one language,
and RFC2045 addressed most of that with
the addition of a
Content-Type header (among other things) so that mail
message bodies could be written in whatever language that the author desired.
Using non-ASCII in message headers, however, posed a bit of a problem, mostly
due to the fact that most message handling software at the time did weird things
to email headers, including reordering addresses in To/CC fields, wrapping
headers in arbitrary locations, along with a general inability to correctly
parse headers that contained “special” characters like
,, even though
a mechanism for escaping these characters was defined in RFC822.
Encoded Words for Non-ASCII Email Headers
This brings us to RFC2047, where:
the techniques outlined here were designed to allow the use of non-ASCII characters in message headers in a way which is unlikely to be disturbed by the quirks of existing Internet mail handling programs.
[…] certain sequences of “ordinary” printable ASCII characters (known as “encoded-words”) are reserved for use as encoded data. The syntax of encoded-words is such that they are unlikely to “accidentally” appear as normal text in message headers. Furthermore, the characters used in encoded-words are restricted to those which do not have special meanings in the context in which the encoded-word appears.
Generally, an “encoded-word” is a sequence of printable ASCII characters that begins with “=?”, ends with “?=”, and has two “?”s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.
Looking back at the original email that my coworker received, things start to make a bit more sense. It seems as though this somewhat badly configured spam-like auto-responder used the encoded-word format that was most likely intended for the Subject: header, but was instead used in the message body.
If we note that
\xc3\x6k is the byte sequence for
æ in UTF-8, we can almost
Hello =?utf-8?Q?Andreas_Birkeb=C3=A6k?=, was actually supposed to be
Hello Andreas Birkebæk,. A quick check with Python confirms my suspicions:
Yet another reminder that no matter how hard you try to avoid them, character encoding problems can pop up in the strangest of places.