UTF-8 mojibake – a practical guide to understanding decoding errors

Imagine you have a book before you written in Bokmål (“Book tongue” or literary Norwegian). You are reading this book but there is a catch: You only know Danish. What’s worse is that you don’t even know that other languages exist. In fact, you don’t have the capacity to imagine them. As far as you’re concerned, “Danish” and “language” are one and the same thing.

So what happens when you start reading? A real Danish speaker would recognise that, while the text looks a fair bit like modern Danish – as Bokmål does – it is not, in fact, Danish. They would probably conclude that it was Norwegian, maybe Swedish, and adjust their expectations accordingly. But you’re not a real Danish speaker, you’re the simpleton I just outlined in the first paragraph.

What will happen is that you will read on, assuming Danish, understand some bits correctly – because Danish and Bokmål share some words – and get others horribly wrong and sometimes you won’t even be aware of your mistakes.

This is what typically happens when text goes off and strange, foreign characters assault our eyes: Faulty assumptions about the character encoding employed.

This post is not a thorough grounding in understanding character encodings and character repertoires. There are other articles that do that. This is attempting to be a shortcut from having seen something that looks off to having enough of an understanding to be able to identify the kind of problem you’re facing.

I am going to talk about the errors that one is likely to encounter in the wild. How to recognise them and understand what happened. So a look at symptoms, more than a course in pathology. Think highschool STD horror slideshow, not college medicine lecture. Hopefully more entertaining, too. If you’re into that sort of thing.

Encoding and decoding

First though, I do need to get some terms straight. When we say that text is encoded, we are talking of the conversion from the characters in your text editor or terminal to bytes written on a disk. That conversion requires picking a character encoding, an agreed upon convention of what bytes mean. When those bytes need to be interpreted as text in your program of choice, we decode the bytes, ideally using the same convention.

Encoding

text presented to user as recognisable characters 🠖 bytes on disk

decoding

bytes on disk 🠖 text presented to user as recognisable characters

I will talk about decoding more than encoding because I am focused on errors. When I see text gone awry I assume it to be a decoding error. If, the assumption goes, the decoding had been done with the right character encoding, the text should have been correct. The right character encoding is the same as the one used for encoding. (This is different from blame, to be clear. If the person responsible for encoding picks an obscure, ancient encoding that is not suitable for the purpose or commonly used, they are probably at fault. But the text is still decodeable with the “right” encoding)

If you’re wondering (as I did) why the bytes don’t come with clear labelling about what they are encoded as, well, let’s extend the analogy I started with: Books rarely try to tell you what language they are written in. It used to be given simply by context: Where was the book published? In what library or bookseller did you find it? What is the dominant or literary language in that place?

Let’s say I am building a computer in the US long before the internet was assumed to be an essential part of computing. Yes, I could call an international congress, demand all other computer makers of the world show up, and agree to a system of labelling our respective character encodings (because if we don’t agree on a uniform system, it doesn’t help). Which would take time, money and effort not to speak of precious bits being lost to country codes. Or I could just make my American computer for Americans with my own character encoding optimised for American characters now and let later generations worry about interoperability. (Later on attempts at systematising all the various encodings of the world did happen, but that effort could not retroactively fit their numbers into the encoding itself)

Terminology

There are a lot of character encodings and a lot of ways to reference them and understand them. I think the only prerequisites for following the post is a vague understanding of the number of bytes being used and the concept of code points. Also hexdecimal and binary numbers, basic arithmetic and an appreciation for Dan Brown-like pseudo-mystery. In hindsight, I should have called the post The Æøå Code.

A curated collection of characters is a character repertoire. Like a musical repertoire defines what is available to an orchestra, a character repertoire sets the limits on what you can enter into a text that draws on that repertoire. Each character in a repertoire has a number assigned to it. That number I will refer to as a code point. Any given version of Unicode is a character repertoire. A sub-section of Unicode, like the Basic Multilingual Plane, that hosts probably all the chacarcters used in this post, can be a character repertoire by itself. You can create your own repertoire fro that matter. It’s just an “arbitrary” group of characters that has been included in a set. If you see a character referred to as U+xxxx (where the x’s are hexdecimal characters) you’re looking at a code point reference in the character repertoire Unicode. Character repertoires and code point as concepts aren’t particularly “computery” – you can just as easily imagine them as very orderly type cases.

With unicode it became very important to distinguish character repertoires (and their code points) from character encodings and the bits and bytes that they use (sometimes referred to as code units). Older character repertoires tended to encompass an encoding scheme and so the distinction got blurred. Unicode does not have a single, default encoding so we need the distinction. When encoding Unicode text you have a choice of three encodings: UTF-8, UTF-16, and UTF-32.

At the risk of repeating myself a bit, UTF-8 is an encoding of Unicode; they are not the same thing, even if they are sometimes used as synonyms. Unicode as a character repertoire is a catalogue or compendium. A big, giant list of (in principle) all the symbols in all the world. It is an amazing work, like Tycho Brahe’s observations of planets or von Linné’s classification of the natural world. Like those works, it’s not so much intellectual genius, though, as much as dogged thoroughness and systematization. Characters in Unicode are typically represented with a number, their place in the big table. å has the number E5 in hexadecimal, aka 229 in decimal. It’s important not to confuse this with UTF8. You can refer to å in some languages as “unicode character 229” (e.g. HTML or CSS) but that number is separate from the code units used when encoding the character in UTF-8. A simple way to understanding it could be to remember that the Unicode table is a way of presenting the characters in a way that makes sense to humans – a long list, divided into logical sections – whereas the encoded character (in one of a number of ways to encode Unicode characters) is the most efficient way for machines to store the characters.

Things like latin-1 (aka ISO-8859-1), CP1252 (aka Windows latin-1 aka “ANSI”) and ASCII are all standards that include both a character repertoire and character encoding of said repertoire. Or to put it another way: They are character encodings with a “coverage” that is defined in the encoding standard itself. They all require a single byte to encode a character. In principle ASCII only requires 7 bits, i.e. one bit less than a byte, but on modern systems that just means that the most significant bit on each byte is always set to zero when encoding using ASCII. In these kinds standards, code points, i.e. the character’s index in the repertoire, also tends to be code units, i.e. the byte value written to disk.

If you’ve been wondering why I don’t use the term “character set”, this is it. Like the standards it arose with, it blurs the line between character repertoires and character encodings. It sounds like it’s a synonym for character repertoire but when used e.g. on the internet, it really means character encoding. As Jukkka Korpela writes:

This is confusing beacuse people often understand “set” as “repertoire”. However , character set means a very specific internal representation of character, and for the same repertoire, several different “character sets” can be used. […] It is advisable to avoid the phrase “character set” when possible.
Jukka Korpela, “Unicode explained” (2006), p. 48

A UTF-8 character can require 1, 2, 3 or 4 bytes. If a UTF-8 character only takes up one byte, it’s the same as ASCII. Same character, same byte value (or code unit). That way, ASCII encoded text is perfectly valid UTF-8. This is one of the main selling point of UTF-8 over other attempts at encoding Unicode: It does not demand that people not using anything but [A-Za-z0-9] sacrifice two or more bytes for every character. Only people using other characters have to expend more bytes per character. Sucks to be them, tough titty, etc. For the rest of the Unicode repertoire, UTF-8 relies on a more complex algorithm that we will not get into here.

For the rest of the Unicode repertoire, UTF-8 relies on a more complex algorithm that we will skip by

I am going to use Python as a tool to do both the encoding and decoding because it’s transparent about the encoding and decoding processes but the conclusions apply equally if you are using a text editor, a web browser, etc. In the rest of this post I will look at text encoded as UTF-8 that gets misinterpreted as an older region specific character encoding. There are a thousand ways this can happen but an obvious example would be a modern web page, like this one, displaying in a really old browser.

UTF-8 to Latin-1

Latin-1, also known as ISO-8859-1 or “Western”, is a character encoding where each character is one byte. It was the most commonly used character encoding in Western Europe before UTF-8, and I dunno, maybe still is? So as the transition from one to the other is still under way, mistakes abound about what any given lump of bytes is encoded as. It adds common characters used in Western European languages to the ASCII set, using an eigth bit (ASCII takes up 7 bits) to make room for an additional 128 characters.

Firstly and most importantly: Text encoded or saved as UTF-8 can always be decoded or rendered as if it were Latin-1 encoded. Don’t get me wrong, the decoded text will be incorrect but you should not get outright error messages from your web browser or programming language or text editor or whatever is doing the decoding. The program reading the bytes will display some funky looking text but be none the wiser. A fair amount will also be the right characters but some will be wrong, some may be invisible and some single characters will get turned into multiple characters. I will get to the reason for this a little later.

Here’s the Danish alphabet as an example. First I write it as a Python 3 string and then I ‘save’ it (or encode it) as UTF-8. Python is not actually saving the variable to disk but the process is the same as if it were.

alfabet = "abcdefghijklmnopqrstuvwxyzæøå"
utf8_alfabet = alfabet.encode('utf-8')

If you want to approxo-pronounce the last three, they are a short ‘ay’ (æ), ‘eh’ but with a puckered mouth (ø) and a shortened ‘oh’ (å). They were tacked onto the Danish alphabet in the 1940s after liberation from Germany because apparently inventing our very own letters was a way of sticking it to the Bosh (in a safe way, after they had been defeated militarily). Go figure.

When I ask Python to show me the saved version (utf8_alfabet) this is what I get:

b'abcdefghijklmnopqrstuvwxyz\xc3\xa6\xc3\xb8\xc3\xa5'

The preceding b is Python’s way of telling me that the following is a series of raw bytes, not interpreted text or string. Up until the z all is recognisable. The last three characters (“æøå”) are represented by the hexadecimal value of their code units in UTF-8.

Just to show how it works when we do it right, I am going to ask Python to decode the bytes for me using the correct encoding, i.e. the one I used to encode with:

utf8_alfabet.decode('utf-8')
'abcdefghijklmnopqrstuvwxyzæøå'

Now, let’s dumb down and assume that the bytes we have before us are really Latin-1.

utf8_alfabet.decode('latin-1')
'abcdefghijklmnopqrstuvwxyzÃ¦Ã¸Ã¥'

Now we’re cooking! That looks nice and awful. Of course, the a-z bit is fine. Latin-1 and UTF-8 share common code points for the ASCII characters but they don’t agree on anything else.

While it is useful for any text writer to know that text looking something like this probably went through the UTF-8-to-Latin-1 mistranslation service, I think we have to understand why I am shown those characters in particular? How did æøå become Ã¦Ã¸Ã¥, specifically?

In UTF-8 those three letters are represented by two bytes each. That means that æ has the byte value \xc3\xa6, ø is \xc3\xb8 and å is \xc3\xa5. The \x is Python’s way of saying that the following two characters are code units expressed as hexdecimal values and not e.g. a literal ‘c3’.

That means that when I asked Python to save or encode my æ as UTF-8, it got written as the two hexadecimal values C3 (195, in decimal) and A6 (166, in decimal). UTF-8 knows instinctively that these two values should be read together as the code point for one character without any need for ‘spaces’ to separate it out (ok, not really instinctively but it will suffice as an explanation for our purposes).

However, I am not asking UTF-8, I am asking Latin-1. And Latin-1 sees two bytes, two values, equalling two characters. If you look at the code page layout for Latin-1 and find the row beginning with 192 and pick the fourth column (“192, 193, 194, 195…”) you will find the character Ã, aka “capital A tilde”. Similarly the value 166 in the code page corresponds to ¦ or “broken bar”. That is how æ turned into Ã¦. The other letters follow the same path.

Note that because all the UTF-8 values for the Nordic letters happen to start with the C3 value, the misrepresentations all start with the Ã character. This is just coincidence.

UTF-8 to CP1252 (aka Windows-1252 aka ‘ANSI’)

Like ASCII before it, Latin-1 reserves a lot of space for control characters. Control characters are non-printing characters. Take the tab control character: It tells the program to move to the next tabulator stop. A lot of them gradually fell out of use as people stopped having to give direct instructions to devices like printers and disk drives and punch hole thingies, leaving that sort of thing to drivers and the like.

Some developers saw that as prime real estate, ready for the taking: Swathes of code units just lying there, unused. Manifest destiny and all that.

Some developers included Microsoft. Microsoft wanted Word to have smart quotes, quotes that bend either one way or the other depending on whether it goes before or after text. “Like” “this”. Latin-1 didn’t have that. So they took Latin-1, and dumped 26 characters into space Latin-1 had reserved for control characters, and called it code page 1252 (CP1252). It was the default in all of the Windows 9x’s and has managed to stick around long after.

As a consequence it’s pretty prevalent and sometimes it mistakenly makes it’s way out onto the internet. For some reason it often get’s called ANSI but a persnickety Wikipedia editor is at pains to point out that this is “wrong”.

Enough with the history lesson, let’s get to the point: Mostly this conversion failure looks the same as when decoding UTF-8 as Latin-1 because the code pages are mostly the same. However, there are ways to tell the difference. As any child knows the raison d’être for Unicode is not to facilitate sharing of Sanskrit or combining Latin and Cyrillic characters, it’s so we can have poop emojis everywhere, including in Python source code.

poop = '💩'
utf8_poop = poop.encode('utf-8')

Asking Python to show me the utf8_poop bytes it gives me

b'\xf0\x9f\x92\xa9'

And now I see what those four bytes look like if they are read as Latin-1 and CP1252, respectively:

utf8_poop.decode('latin-1')
'ð\x9f\x92©'
utf8_poop.decode('cp1252')
'ðŸ’©'

As we can see the four bytes are read as four characters in both encodings. In both cases f0 is represented as ð or ‘eth’, the Icelandic soft d character, and a9 is represented as the copyright symbol, ©. But CP1252 also has representations of 9f (Y with diaresis) and 92 (single closing smart quotation mark) where Latin-1 just shows them as code units. The reason is that those byte values were part of the Ole Land Grab of 26 that I mentioned above. So in Latin-1 they represent non-printing control characters and in CP1252 they are blatant attempts at currying favour with mainstream computer consumers, easily wowed by cosmetic schtick.

UTF-8 to ASCII

Something is bound to go wrong if you think the world’s largest character repertoire is really the world’s smallest. I don’t think this happens all that often – for one thing ASCII is not even on the list of encodings that I can force Firefox to use on a page – but I’ll include it for the sake of illustration.

Here is the opening lines of La Marseillaise, the French anthem, as a Python string that get’s encoded as UTF-8. The triple single quotes are a way to write multi-line strings in Python, nothing more.

anthem = '''Allons enfants de la Patrie,
Le jour de gloire est arrivé !
Contre nous de la tyrannie
L'étendard sanglant est levé'''
utf8_anthem = anthem.encode('utf-8')

You can probably spot already that we are going to get into trouble once we get to the accents over the e’s on lines 2 and 4. Sidenote: They’re accents (acute accents, to be exact), not apostrophes, and this should be on the test you take before being allowed to use a keyboard. In byte form you can see that the é’s are encoded as byte value C3 A9:

b"Allons enfants de la Patrie,\nLe jour de gloire est arriv\xc3\xa9 !\nContre nous de la tyrannie\nL'\xc3\xa9tendard sanglant est lev\xc3\xa9"

And when I ask Python to decode those bytes, assuming they are ASCII, it falls on it’s arse:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 56: ordinal not in range(128)

Despite the technical mumbo jumbo this error message is actually pretty clear. It is one particular byte in a particular place in the bytes list that is giving Python trouble. Specifically 0xc3, which is easily recognisable as the first of the two bytes that make up the code point C3 A9. Like Latin-1, ASCII doesn’t do multiple byte characters. One byte, one character. So C3 and A9 must refer to two separate characters. If I’m in doubt it also tells me where to find the troublesome character: At position 56. Since Python always starts counting from zero, this is actually the 57th position which is where we find the first é.

What then is the problem with this byte? Python says ordinal not in range(128). range(128) is Python’s way of saying “the list comprising every natural number from 0 to 127” ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, … 127]) Those numbers are the code units that are used in ASCII. Anything above 127 is not a code unit for the simple reason that you cannot express values of 128 or above using only 7 bits. C3 is a hexadecimal value. In decimal it’s 195. 195 is not found in the list of 0 to 127, so it’s “not in range(128)”. Python is of course correct, that it is a Unicode related error (“UnicodeDecodeError”) though to be honest I’m not sure how it knows.

Basically, what’s going on here is that most other encodings use full bytes, i.e. 8 bits and so each byte can be any of 256 (2⁸) values. ASCII only uses 7 bits so there are only 128 possible values (2⁷).

By default when Python decodes, it uses the setting errors="strict" which means that errors cause it to throw a fit, like we saw. We can ask it to be more accomodating and just replace unknown characters with �, aka the replacement character, like this:

utf8_anthem.decode('ascii', errors="replace")

And so we get this abomination (note that each é is replaced with two error character for the reasons mentioned previously):

Allons enfants de la Patrie,
Le jour de gloire est arriv�� !
Contre nous de la tyrannie
L’��tendard sanglant est lev��
Ceci n’est pas “La Marseillaise”

Mon dieu!

Conclusion

I will try to sum up so as to make an easy-to-use checklist when bug hunting.

Mostly when you decode UTF-8 with Latin-1 or CP1252 you will get something. That something will be the wrong characters and often the wrong number of characters but you will get characters. It can be difficult to tell which decoding has been used (if that information cannot be gleaned from the browser/editor/application), but replacement characters or weirdly shortened words hint at a latin-1 decoding, whereas the presence of smart quotes and doodads would indicate that the text is assumed to be CP1252.

ASCII decoding is special because roughly half of all possible byte values are not permissible ASCII characters. This will either result in errors, missing characters or replacements, depending on your text viewer’s/decoder’s settings. Be on the lookout for the replacement character, especially in browsers as they tend to try their level best to show you something rather than nothing.

Of course, this is not a surefire diagnostic as there are hundreds of encodings and the permutations are practically infinite. Adding to the complexity is that most modern decoders will make educated guesses when a clear labelling or instruction of which character encoding to use is missing. And so this snippet of pure text from a mailing list, ironically discussing encodings, is Cyrillic (or ISO-8859-5) according to Firefox 88 whereas Chromium 90 guesses (correctly) that it’s probably just Latin-1. In terms of the simile I started with, browsers do know that different languages exist but if they aren’t told what language they have before them, they have to guess. And some guess better than others.

In the next post I will look at what happens when we swap the order, i.e. text encoded as old encodings gets mistaken for UTF-8. Harder to spot and harder to understand. Just as much fun, though.