Well, Not really.
My name is, in fact, Joël. You’d never guess that, however, from the consistent and diverse ways in which my non-ASCII given name is butchered by web applications, email servers, databases, and overzealous baristas that seem to believe I am the father of Superman1.
Most of my name-based problems are rooted in the incorrect application of character encodings at various points within the data input and output pipelines of modern web apps.
That’s probably not what’s going on with the baristas, but I can’t be certain.
The Many Faces of Mojibake2
Not a week goes by without the inevitable email that maligns my first name in the same manner that a chimpanzee of mediocre intelligence would attempt to bash out a haiku on a broken typewriter.
Here are the most recent examples that I could come up with, after a few minutes of searching my inbox and my photo library:
You’d think that, as a database company whose product only supports UTF8 strings, they would get a simple diacritic on an ASCII character right.
And Twitter, with users all over the world that use their platform in dozens of languages, should have definitely got my name right in a marketing email.
Pass the Table
Pass The Table, a somewhat underground Toronto-only mobile application for the discerning foodie, seems to just give up after encountering a character it wasn’t expecting. Guess the creators didn’t count on anyone without anglo-centric names signing up!
Foodora, one of those 3rd party restaurant delivery services that are all the rage, plops in the standard Unicode replacement character instead of the oh-so-difficult ë.
AirBnB seems to think that I’m… Czech? I guess?
Hudson’s Bay, a 340+ year old company that started out as a fur trading organization that is now a very large conglomerate of department stores, operates in both English and French in Canada. Mostly in English, as I’m sure you can tell.
Of all places where I’d hope that encoding problems wouldn’t show up, airport check-in kiosks make the top of the list. This was years ago, thankfully, and Air Canada has since improved their system.
Pharmaprix is the Québecois-branded French version of a pharmacy chain known as Shopper’s Drug Mart in the rest of Canada. If you’re not familiar with French you may not necessarily see the problem, but “Vous avez ÚconomisÚ” should read “Vous avez économisé”; the accented
é has been replaced with an uppercase accented
Ú. And then when you read a bit more closely, you realize that every single accented character in the footer of the receipt is wrong as well, and wonder how a receipt-generation system can fail so badly.
Reverse Engineering the Most Common Encoding Failures
Most of the time, I can guess as to what happened when things go wrong. One of the easiest ways to confirm my suspicions is with the
iconv tool, which is included in any operating system that includes glibc3. Here are the most common scenarios:
- Storing data as Latin1/ISO8859-1, and extracting it out as UTF8:
The cause of this encoding failure is, generally, inattention. Many database systems allow client connections to have a different character encoding than what the server is configured with. If you attempt to read Latin1/ISO8859-1 encoded data on the server but have a client that is configured to read the stream of bytes as UTF8, you get
Believe it or not, this is incredibly common, and many developers don't realize there's a problem until they attempt to integrate a 3rd party tool to interact with their database, or when they export some data to be used in an external program such as a campaign-based email tool like MailChimp or Campaign Monitor4. It's subtle because, as we described earlier, ASCII characters will remain unchanged, but anything out of that character set will be susceptible to some mojibake.
- Storing data as UTF8, and extracting it out as Latin1:
This is the opposite of the previous example, and a bit less common: you can see it in the Foodora example that was shown earlier. What most likely happened there was that their data was stored as UTF8 in their database, but some export tool they used defaulted to Latin1, and they then loaded up this botched data into their 3rd party campaign emailer system.
The Often Forgotten Fragility of Character Encodings
Due to the nature of data storage representations in software/hardware systems, all printable (and many non-printable) characters need to be represented by an encoding. Many of you will already be familiar with this concept, but if you are not, then the simplest analogy to describe is that of Morse Code: Each of the 26 letters in the Latin/Roman alphabet are assigned a sequence of dots and/or dashes that uniquely5 identify them within that particular system of representation.
In computing, the idea is similar: once we have agreed upon an injective mapping between characters and a numerical value, we in turn represent said characters as a combination of one or more bytes that can be stored on modern hardware. As long as we have the mapping handy, we’re able to easily reconstruct the original from the sequence of bits that it ended up being stored as.
Well, it’s only easy if everyone agrees on the same encoding. The great thing about standards is that there are always a plethora to choose from. Due to a variety of historical twists and turns, it turns out that most people agreed on how to represent ASCII characters6 in almost all encodings that people care about, but that’s about where the consensus ends.
If you’re interested in that history, “The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets”, by Joel Spolsky, is a great read. It only scratches the surface, however, of just how convoluted and difficult getting character encodings right can actually be.
If the entirety of your content resides within the usual ASCII character set, then mixing different character encodings doesn’t typically cause problems: nearly all modern encoding standards use the same code points for the Roman alphabet. This is why many developers in North America don’t notice there’s a problem until well into production, at which point fixing it retroactively is much, much more painful than if it had been caught earlier on in the development cycle.
Use a Unicode Canary
One of the ways that I (and Fictive Kin) attempt to mitigate this character encoding rot is by ensuring that our QA process and our test suites utilize non-ASCII characters in as many places as possible. If you accept input from a user, chances are that at some point, someone will put non-ASCII characters in there (and most likely not maliciously, either!), so you might as well do that from the get-go yourself.
One of the most wonderful things about the web is that it is truly international. Don’t mess it up by forcing your users to read mojibake, when all they really wanted was to use their real name when signing up for your application.
- Jor-El is the biological father of Superman in the world of DC Comics, and I have on multiple occasions received cups with his name inscribed upon the side instead of my own. ⏎
- Mojibake ⏎
- Which means pretty much every Linux distribution in existence, and most *BSDs. ⏎
- Note that MailChimp, Campaign Monitor, etc. are not at fault here – they simply act upon the data that they are provided with. ⏎
- There are, in fact, several competing Morse Code systems where letters are encoded differently. Just like character encoding systems in modern computing! ⏎
- The set of characters on a standard US English keyboard plus some non-printing/control characters that you may have never heard about because you were born after the 1960s. ⏎