UTF-8 mojibake – a practical guide to understanding decoding errors

Solving the mystery of scrambled text one ЋВЛ� at a 💩å.

Imagine you have a book before you written in Bokmål (“Book tongue” or literary Norwegian). You are reading this book but there is a catch: You only know Danish. What’s worse is that you don’t even know that other languages exist. In fact, you don’t have the capacity to imagine them. As far as you’re concerned, “Danish” and “language” are one and the same thing.

So what happens when you start reading? A real Danish speaker would recognise that, while the text looks a fair bit like modern Danish – as Bokmål does – it is not, in fact, Danish. They would probably conclude that it was Norwegian, maybe Swedish, and adjust their expectations accordingly. But you’re not a real Danish speaker, you’re the simpleton I just outlined in the first paragraph.

What will happen is that you will read on, assuming Danish, understand some bits correctly – because Danish and Bokmål share some words – and get others horribly wrong and sometimes you won’t even be aware of your mistakes.

This is what typically happens when text goes off and weird and foreign characters assault our eyes: Faulty assumptions about the encoding employed.

This post is not a thorough grounding in understanding encodings and character sets. There are other articles that do that. This is attempting to be a shortcut from having seen something that looks off to having enough of an understanding to be able to identify the kind of problem you’re facing.

I am going to talk about the errors that one is likely to encounter in the wild. How to recognise them and understand what happened. So a look at symptoms, more than a course in pathology. Think highschool STD horror slideshow, not college medicine lecture. Hopefully more entertaining, too. If you’re into that sort of thing.

Encoding and decoding

First though, I do need to get some terms straight. When we say that text is encoded, we are talking of the conversion from the characters or glyphs you’re seeing on your screen, in your text editor or terminal, to bytes written on a disk. That conversion requires picking a character encoding, an agreed upon convention of what bytes mean. When those bytes need to be interpreted as text in your program of choice, we decode the bytes, ideally using the same convention.

Encoding

text presented to user as recognisable characters 🠖 bytes on disk

decoding

bytes on disk 🠖 text presented to user as recognisable characters

I will talk about decoding more than encoding because I am focused on errors. When I see text gone awry I assume it to be a decoding error. If, the assumption goes, the decoding had been done with the right character encoding, the text should have been correct. The right character encoding is the same as the one used for encoding. (This is different from blame, to be clear. If the person responsible for encoding picks an obscure, ancient encoding that is not suitable for the purpose or commonly used, they are probably at fault. But the text is still decodeable with the “right” encoding)

If you’re wondering (as I did) why the bytes don’t come with clear labelling about what they are encoded as, well, let’s extend the analogy I started with: Books rarely try to tell you what language they are written in. It used to be given simply by context: Where was the book published? In what library or bookseller did you find it? What is the dominant or literary language in that place?

HTML does require specifying a charset on every page, embedded in the page, but then HTML was born on a world spanning network whereas the computer as such was not. Let’s say I am building an American computer in Americaland long before the internet was assumed to be an essential part of computing. Yes, I could call an international congress, demand all other computer makers of the world show up, and agree to a system of labelling our respective character encodings (because if we don’t agree on a uniform system, it doesn’t help). Which would take time, money and effort not to speak of precious bits being lost to country codes. Or I could just make my American computer for Americans with my own character encoding optimised for American glyphs now and let later generations worry about interoperability. (Later on attempts at systematising all the various encodings of the world did happen, but that effort could not retroactively fit their numbers into the encoding itself)

Character encodings, quickly

There are a lot of character encodings and a lot of ways to reference them and understand them. I think the only prerequisites for following the post is a vague understanding of the number of bytes being used and the concept of code points. Also hexdecimal and binary numbers, basic arithmetic and an appreciation for Dan Brown-like pseudo-mystery. In hindsight, I should have called the post The Æøå Code.

Code points are numbers that refer to a place in a character encoding’s big table of characters. If the American alphabet was a character encoding, the letter e would have the decimal code point 5 (assuming we start counting from 1). Code points are represented differently, so that you can tell they are code points and not just ordinary text. In my Python examples you will see an escaped x (\x) preceding code points to tell you that these are code points written in hexadecimal.

In the olden days a character encoding could use an arbitrary number of bits. 6, 7, 8 however many you needed. At some point it was standardized on 8 or a byte. That way, even if you didn’t know what encoding was in use, at least you knew that these eight bits taken together should be interpreted as a single character. This applies to all the ASCII-and character encodings, like Latin-1. UTF-8 builds on that but adds many more code points so it needs more bytes.

UTF-8 is an encoding of Unicode; they are not the same thing, even if they are sometimes used as synonyms. This distinction is super important, so I will try to go a bit slow here. Unicode is a catalogue or compendium. A big, giant list of (in principle) all the symbols in all the world. It is an amazing work, like Tycho Brahe’s observations of planets or von Linné’s classification of the natural world. Like those works, it’s not so much intellectual genius, though, as much as dogged thoroughness and systematization. Characters in Unicode are typically represented with a number, their place in the big table. å has the number E5 in hexadecimal, aka 229 in decimal. It’s important not to confuse this with UTF8. You can refer to å in some languages as “unicode character 229” (e.g. HTML or CSS) but that number is separate from the encoding of the character in UTF8. A simple way to understanding it could be to remember that the Unicode table is a way of presenting the characters in a way that makes sense to humans – a long list, divided into logical sections – whereas the encoded character (in one of a number of ways to encode Unicode characters) is the most efficient way for machines to store the characters.

A UTF-8 character can require 1, 2, 3 or 4 bytes. If a UTF-8 character only takes up one byte, it’s the same as ASCII. Same character, same code point. That way, ASCII encoded text is perfectly valid UTF-8. This is one of the main selling point of UTF-8 over other attempts at encoding Unicode: It does not demand that people not using anything but [A-Za-z0-9] sacrifice two or more bytes for every character. Only people using other glyphs have to expend more bytes per character. Sucks to be them, tough titty, etc.

I am going to use Python as a tool to do both the encoding and decoding because it’s transparent about the encoding and decoding processes but the conclusions apply equally if you are using a text editor, a web browser, etc. In the rest of this post I will look at text encoded as UTF-8 that gets misinterpreted as an older region specific character encoding. There are a thousand ways this can happen but an obvious example would be a modern web page, like this one, displaying in a really old browser.

UTF-8 to Latin-1

Latin-1, also known as ISO-8859-1 or “Western”, is a character encoding where each character is one byte. It was the most commonly used character encoding in Western Europe before Unicode, and I dunno, maybe still is? So as the transition from one to the other is still under way, mistakes abound about any given lump of bytes is encoded as. It adds common characters used in Western European languages to the ASCII set, using an eigth bit (ASCII takes up 7 bits) to make room for an additional 128 characters.

Firstly and most importantly: Text encoded or saved as UTF-8 can always be decoded or rendered as if it were Latin-1 encoded. Don’t get me wrong, the decoded text will be incorrect but you should not get outright error messages from your web browser or programming language or text editor or whatever is doing the decoding. The program reading the bytes will display some funky looking text but be none the wiser. A fair amount will also be the right characters but some will be wrong, some may be invisible and some single characters will get turned into multiple characters. I will get to the reason for this a little later.

Here’s the Danish alphabet as an example. First I write it as a Python 3 string and then I ‘save’ it (or encode it) as UTF-8. Python is not actually saving the variable to disk but the process is the same as if it were.

alfabet = "abcdefghijklmnopqrstuvwxyzæøå"
utf8_alfabet = alfabet.encode('utf-8')

If you want to approxo-pronounce the last three, they are a short ‘ay’ (æ), ‘eh’ but with a puckered mouth (ø) and a shortened ‘oh’ (å). They were tacked onto the Danish alphabet in the 1940s after liberation from Germany because apparently inventing our very own letters was a way of sticking it to the Bosh (in a safe way, after they had been defeated militarily). Go figure.

When I ask Python to show me the saved version (utf8_alfabet) this is what I get:

b'abcdefghijklmnopqrstuvwxyz\xc3\xa6\xc3\xb8\xc3\xa5'

The preceding b is Python’s way of telling me that the following is a series of raw bytes, not interpreted text or string. Up until the z all is recognisable. The last three characters (“æøå”) are represented by the hexadecimal value of their code points in UTF-8.

Just to show how it works when we do it right, I am going to ask Python to decode the bytes for me using the correct encoding, i.e. the one I used to encode with:

utf8_alfabet.decode('utf-8')
'abcdefghijklmnopqrstuvwxyzæøå'

Now, let’s dumb down and assume that the bytes we have before us are really Latin-1.

utf8_alfabet.decode('latin-1')
'abcdefghijklmnopqrstuvwxyzæøå'

Now we’re cooking! That looks nice and awful. Of course, the a-z bit is fine. Latin-1 and UTF-8 share common code points for the ASCII characters but they don’t agree on anything else.

While it is useful for any text writer to know that text looking something like this probably went through the UTF-8-to-Latin-1 mistranslation service, I think we have to understand why I am shown those characters in particular? How did æøå become æøå, specifically?

In UTF-8 those three letters are represented by two bytes each. We will get in to more detail about why and how when we reverse the encoding order. For now that is all you need to know about UTF-8 encoding. That means that æ has the code point \xc3\xa6, ø is \xc3\xb8 and å is \xc3\xa5. The ‘\x’ is Python’s way of saying that the following two characters are code points expressed in hexdecimal and not a literal ‘c3’.

That means that when I asked Python to save or encode my æ as UTF-8, it got written as the two hexadecimal values C3 (195, in decimal) and A6 (166, in decimal). UTF-8 knows instinctively that these two values should be read together as the code point for one character without any need for ‘spaces’ to separate it out (ok, not really instinctively but we will get to how later).

However, I am not asking UTF-8, I am asking Latin-1. And Latin-1 sees two bytes, two values, equalling two characters. If you look at the code page layout for Latin-1 and find the row beginning with 192 and pick the fourth column (“192, 193, 194, 195…”) you will find the character Ã, aka “capital A tilde”. Similarly the value 166 in the code page corresponds to ¦ or “broken bar”. That is how æ turned into æ. The other letters follow the same path.

Note that because all the UTF-8 values for the Nordic letters happen to start with the C3 value, the misrepresentations all start with the à character. This is just coincidence.

UTF-8 to CP1252 (aka Windows-1252 aka ‘ANSI’)

Like ASCII before it, Latin-1 reserves a lot of code points for control characters. Control characters are non-printing characters. The most recognisable and easily-understandable today would probably be the tabulator or tab: It tells the program to move to the next tab stop. A lot of them gradually fell out of use as people stopped having to give direct instructions to devices like printers and disk drives and punch hole thingies, leaving that sort of thing to drivers and the like.

Some developers saw that as prime real estate, ready for the taking: Swathes of code points just lying there, unused. Manifest destiny and all that.

Some developers included Microsoft. Microsoft wanted Word to have smart quotes, quotes that bend either one way or the other depending on whether it goes before or after text. “Like” “this”. Latin-1 didn’t have that. So they took Latin-1, and dumped 26 characters into space Latin-1 had reserved for control characters, and called it code page 1252 (CP1252). It was the default in all of the Windows 9x’s and has managed to stick around long after.

As a consequence it’s pretty prevalent and sometimes it mistakenly makes it’s way out onto the internet. For some reason it often get’s called ANSI but a persnickety Wikipedia editor is at pains to point out that this is “wrong”.

Enough with the history lesson, let’s get to the point: Mostly this conversion failure looks the same as when decoding UTF-8 as Latin-1 because the code pages are mostly the same. However, there are ways to tell the difference. As any child knows the raison d’être for Unicode is not to facilitate sharing of Sanskrit or combining Latin and Cyrillic characters, it’s so we can have poop emojis everywhere, including in Python source code. Here we go:

poop = '💩'
utf8_poop = poop.encode('utf-8')

Asking Python to show me the utf8_poop bytes it gives me

b'\xf0\x9f\x92\xa9'

And now I see what those four bytes look like if they are read as Latin-1 and CP1252, respectively:

utf8_poop.decode('latin-1')
'ð\x9f\x92©'

utf8_poop.decode('cp1252')
'💩'

As we can see the four bytes are read as four characters in both encodings. In both cases f0 is represented as ð or ‘eth’, the Icelandic soft d character, and a9 is represented as the copyright symbol, ©. But CP1252 also has representations of 9f (Y with diaresis) and 92 (single closing smart quotation mark) where Latin-1 just shows them as code points. The reason is that those code points are part of the Ole Land Grab o’ 26 that I mentioned above. So in Latin-1 they represent non-printing control characters and in CP1252 they are blatant attempts at currying favour with mainstream computer consumers, easily wowed by cosmetic schtick.

UTF-8 to ASCII

Something is bound to go wrong if you think the world’s largest character set is really the world’s smallest. I don’t think this happens all that often – for one thing ASCII is not even on the list of encodings that I can force Firefox to use on a page – but I’ll include it for the sake of illustration.

Here is the opening lines of La Marseillaise, the French anthem, as a Python string that get’s encoded as UTF-8. The triple single quotes are a way to write multi-line strings in Python, nothing more.

anthem = '''Allons enfants de la Patrie,
Le jour de gloire est arrivé !
Contre nous de la tyrannie
L'étendard sanglant est levé'''

utf8_anthem = anthem.encode('utf-8')

You can probably spot already that we are going to get into trouble once we get to the accents over the e’s on lines 2 and 4. Sidenote: They’re accents (acute accents, to be exact), not apostrophes, and this should be on the test you take before being allowed to use a keyboard. In byte form you can see that the é’s are encoded as code point C3 A9:

b"Allons enfants de la Patrie,\nLe jour de gloire est arriv\xc3\xa9 !\nContre nous de la tyrannie\nL'\xc3\xa9tendard sanglant est lev\xc3\xa9"

And when I ask Python to decode those bytes, assuming they are ASCII, it falls on it’s arse:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 56: ordinal not in range(128)

Despite the technical mumbo jumbo this error message is actually pretty clear. It is one particular byte in a particular place in the bytes list that is giving Python trouble. Specifically 0xc3, which is easily recognisable as the first of the two bytes that make up the code point C3 A9. Like Latin-1, ASCII doesn’t do multiple byte characters. One byte, one character. So C3 and A9 must refer to two separate characters. If I’m in doubt it also tells me where to find the troublesome character: At position 56. Since Python always starts counting from zero, this is actually the 57th position which is where we find the first é.

What then is the problem with this byte? Python says ordinal not in range(128). range(128) is Python’s way of saying “the list comprising every natural number from 0 to 127” ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, … 127]) Those numbers are the code points that exist in ASCII. Anything above 127 is not a code point for the simple reason that that is all you can squeeze into 7 bits. C3 is a hexadecimal value. In decimal it’s 195. 195 is not found in the list of 0 to 127, so it’s “not in range(128)”. Python is of course correct, that it is a Unicode related error (“UnicodeDecodeError”) though to be honest I’m not sure how it knows.

Basically, what’s going on here is that most other encodings use full bytes, i.e. 8 bits and so each byte can be any of 256 (28) values. ASCII only uses 7 bits so there are only 128 possible values (27).

By default when Python decodes, it uses the setting errors="strict" which means that errors cause it to throw a fit, like we saw. We can ask it to be more accomodating and just replace unknown characters with �, aka the replacement character, like this:

utf8_anthem.decode('ascii', errors="replace")

And so we get this abomination (note that each é is replaced with two error character for the reasons mentioned previously):

Allons enfants de la Patrie,
Le jour de gloire est arriv�� !
Contre nous de la tyrannie
L’��tendard sanglant est lev��

Ceci n’est pas “La Marseillaise”

Mon dieu!

Conclusion

I will try to sum up so as to make an easy-to-use checklist when bug hunting.

Mostly when you decode UTF-8 with Latin-1 or CP1252 you will get something. That something will be the wrong characters and often the wrong number of characters but you will get characters. It can be difficult to tell which decoding has been used (if that information cannot be gleaned from the browser/editor/application), but replacement characters or weirdly shortened words hint at a latin-1 decoding, whereas the presence of smart quotes and doodads would indicate that the text is assumed to be CP1252.

ASCII decoding is special because roughly half of all possible byte values are not permissible ASCII characters. This will either result in errors, missing characters or replacements, depending on your text viewer’s/decoder’s settings. Be on the lookout for the replacement character, especially in browsers as they tend to try their level best to show you something rather than nothing.

Of course, this is not a surefire diagnostic as there are hundreds of encodings and the permutations are practically infinite. Adding to the complexity is that most modern decoders will make educated guesses when a clear labelling or instruction of which character encoding to use is missing. And so this snippet of pure text from a mailing list, ironically discussing encodings, is Cyrillic (or ISO-8859-5) according to Firefox 88 whereas Chromium 90 guesses (correctly) that it’s probably just Latin-1. In terms of the simile I started with, browsers do know that different languages exist but if they aren’t told what language they have before them, they have to guess. And some guess better than others.

In the next post I will look at what happens when we swap the order, i.e. text encoded as old encodings gets mistaken for UTF-8. Harder to spot and harder to understand. Just as much fun, though.

brass letter plates in black wooden box © Natalia Yakovleva, Unsplash License

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.