Please watch your character encodings

I started writing this as a newsgroup post for one of Mozilla’s mailing lists, but it turned out to be too long and since this part was mainly aimed at folks who either didn’t know about or wanted a quick refresher on character encodings I decided to blog it instead. Please let me know if there are errors in here, I am by no means an expert on this stuff either and I do get caught out sometimes!

Text is tricky. Unicode supports the notion of 1,114,112 distinct characters, slightly more than a byte of memory can hold. So to store a character we have to use a way of encoding its value into bytes in memory. A straightforward encoding would just use three bytes per character. But (roughly) the larger the character value the less often it is used, and memory is precious, so often variable length encodings are used. These will use fewer bytes in memory for characters earlier in the range at the cost of using a little more memory for the rarer characters. Common encodings include UTF-8 (one byte for ASCII characters, up to four bytes for other characters) and UTF-16 (two bytes for most characters, four bytes for less used ones).

What does this mean?

It may not be possible to know the number of characters in a string purely by looking at the number of bytes of memory used.

When a string is encoded with a variable length encoding the number of bytes used by a character will vary. If the string is held in a byte buffer just dividing its length by some number will not always return the number of characters in a string. Confusingly many string implementations expose a length property, that often only tells you the number of code points, not the number of characters in a string. I bet most JavaScript developers don’t know that JavaScript suffers from this:

let test = "\u{1F42E}"; // This is the Unicode cow 🐮 (https://emojipedia.org/cow-face/)
test.length; // This returns 2!
test.charAt(0); // This returns "\ud83d"
test.charAt(1); // This returns "\udc2e"
test.substring(0, 1); // This returns "\ud83d"

Fun!

More modern versions of JavaScript do give better options, though they are probably slower than the length property (because it must decode the characters to understand the length:

Array.from(test).length; // This returns 1
test.codePointAt(0).toString(16); // This returns "1f42e"

When you encode a character into memory and pass it to some other code, that code needs to know the encoding so it can decode it correctly. Using the wrong encoder/decoder will lead to incorrect data.

Using the wrong decoder to convert a buffer of memory into characters will often fail. Take the character “ñ”. In UTF-8 this is encoded as C3 B1. Decoding that as UTF-16 will result in “쎱”. In UTF-16 however “ñ” is encoded as 00 F1. Trying to decode that as UTF-8 will fail as that is in invalid UTF-8 sequence.

Many languages thankfully use string types that have fixed encodings, in rust for example the str primitive is UTF-8 encoded. In these languages as long as you stick to the normal string types everything should just work. It isn’t uncommon though to do manipulations based on the byte representation of the characters, %-encoding a string for a URL for example, so knowing the character encoding is still important.

Some languages though have string types where the encoding may not be clear. In Gecko C++ code for example a very common string type in use is the nsCString. It is really just a set of bytes and has no defined encoding and no way of specifying one at the code level. The only way to know for sure what the string is encoded as is to track back to where it was created. If you’re unlucky it gets created in multiple places using different encodings!

Funny story. This blog posts contains a couple of the larger unicode characters. While working on the post I kept coming back to find that the character had been lost somewhere along the way and replaced with a “?”. Seems likely that there is a bug in WordPress that isn’t correctly handling character encodings. I’m not sure yet whether those characters will survive publishing this post!

These problems disproportionately affect non-English speakers.

Pretty much all of the characters that English speakers use (mostly the Latin alphabet) live in the ASCII character set which covers just 128 characters (some of these are control characters). The ASCII characters are very popular and though I can’t find references right now it is likely that the majority of strings used in digital communication are made up of only ASCII characters, particularly when you consider strings that humans don’t generally see. HTTP request and response headers generally only use ASCII characters for example.

Because of this popularity when the Unicode character set was first defined, it mapped the 128 ASCII characters to the first 128 Unicode characters. Also UTF-8 will encode those 128 characters as a single byte, any other characters get encoded as two bytes or more.

The upshot is that if you only ever work with ASCII characters, encoding or decoding as UTF-8 or ASCII yields identical results. Each character will only ever take up one byte in memory so the length of a string will just be the number of bytes used. An English speaking developer, and indeed many other developers may only ever develop and test with ASCII characters and so potentially become blind to the problems above and not notice that they aren’t handling non-ASCII characters correctly.

At Mozilla where we try hard to make Firefox work in all locales we still routinely come across bugs where non-ASCII characters haven’t been handled correctly. Quite often issues stem from a user having non-ASCII characters in their username or filesystem causing breakage if we end up decoding the path incorrectly.

This issue may start getting rarer. With the rise in emoji popularity developers are starting to see and test with more and more characters that encode as more than one byte. Even in UTF-16 many emoji encode to four bytes.

Summary

If you don’t care about non-ASCII characters then you can ignore all this. But if you care about supporting the 80% of the world that use non-ASCII characters then take care when you are doing something with strings. Make sure you are checking its length correctly when needed. If you are working with data structures that don’t have an explicit character encoding then make sure you know what encoding your data is in before doing anything with it other than passing it around.