For many years Americans have exchanged text using the ASCII character set; since essentially all U.S. systems support ASCII, this permits easy exchange of English text. Unfortunately, ASCII is completely inadequate in handling the characters of nearly all other languages. For many years different countries have adopted different techniques for exchanging text in different languages, making it difficult to exchange data in an increasingly interconnected world.
More recently, ISO has developed ISO 10646, the “Universal Mulitple-Octet Coded Character Set (UCS)”. UCS is a coded character set which defines a single 31-bit value for each of all of the world’s characters. The first 65536 characters of the UCS (which thus fit into 16 bits) are termed the “Basic Multilingual Plane” (BMP), and the BMP is intended to cover nearly all of today’s spoken languages. The Unicode forum develops the Unicode standard, which concentrates on the UCS and adds some additional conventions to aid interoperability. Historically, Unicode and ISO 10646 were developed by competing groups, but thankfully they realized that they needed to work together and they now coordinate with each other.
If you’re writing new software that handles internationalized characters, you should be using ISO 10646/Unicode as your basis for handling international characters. However, you may need to process older documents in various older (language-specific) character sets, in which case, you need to ensure that an untrusted user cannot control the setting of another document’s character set (since this would significantly affect the document’s interpretation).
Most software is not designed to handle 16 bit or 32 bit characters, yet to create a universal character set more than 8 bits was required. Therefore, a special format called UTF-8 was developed to encode these potentially international characters in a format more easily handled by existing programs and libraries. UTF-8 is defined, among other places, in IETF RFC 3629 (updating RFC 2279), so it’s a well-defined standard that can be freely read and used. UTF-8 is a variable-width encoding; characters numbered 0 to 0x7f (127) encode to themselves as a single byte, while characters with larger values are encoded into 2 to 4 (originally 6) bytes of information (depending on their value). The encoding has been specially designed to have the following nice properties (this information is from the RFC and Linux utf-8 man page):
The classical US ASCII characters (0 to 0x7f) encode as themselves, so files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. This is fabulous for backward compatibility with the many existing U.S. programs and data files.
All UCS characters beyond 0x7f are encoded as a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd. This means that no ASCII byte can appear as part of another character. Many other encodings permit characters such as an embedded NIL, causing programs to fail.
It’s easy to convert between UTF-8 and a 2-byte or 4-byte fixed-width representations of characters (these are called UCS-2 and UCS-4 respectively).
The lexicographic sorting order of UCS-4 strings is preserved, and the Boyer-Moore fast search algorithm can be used directly with UTF-8 data.
All possible 2^31 UCS codes can be encoded using UTF-8.
The first byte of a multibyte sequence which represents a single non-ASCII UCS character is always in the range 0xc0 to 0xfd and indicates how long this multibyte sequence is. All further bytes in a multibyte sequence are in the range 0x80 to 0xbf. This allows easy resynchronization; if a byte is missing, it’s easy to skip forward to the “next” character, and it’s always easy to skip forward and back to the “next” or “preceding” character.
In short, the UTF-8 transformation format is becoming a dominant method for exchanging international text information because it can support all of the world’s languages, yet it is backward compatible with U.S. ASCII files as well as having other nice properties. For many purposes I recommend its use, particularly when storing data in a “text” file.
The reason to mention UTF-8 is that some byte sequences are not legal UTF-8, and this might be an exploitable security hole. UTF-8 encoders are supposed to use the “shortest possible” encoding, but naive decoders may accept encodings that are longer than necessary. Indeed, earlier standards permitted decoders to accept “non-shortest form” encodings. The problem here is that this means that potentially dangerous input could be represented multiple ways, and thus might defeat the security routines checking for dangerous inputs. The RFC describes the problem this way:
Implementers of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax.
A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 (illegal because it’s longer than necessary) and interpret it as a NUL character (00). Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F.
A longer discussion about this is available at Markus Kuhn’s UTF-8 and Unicode FAQ for Unix/Linux at http://www.cl.cam.ac.uk/~mgk25/unicode.html.
Thus, when accepting UTF-8 input, you need to check if the input is valid UTF-8. Here is a list of all legal UTF-8 sequences; any character sequence not matching this table is not a legal UTF-8 sequence. This list is from The Unicode Standard Version 7.0 - Core Specification (2014). In the following table, the first column shows the various character values being encoded into UTF-8. The second column shows how those characters are encoded as binary values; an “x” indicates where the data is placed (either a 0 or 1), though some values should not be allowed because they’re not the shortest possible encoding. The last row shows the valid values each byte can have (in hexadecimal). Thus, a program should check that every character meets one of the patterns in the right-hand column. A “-” indicates a range of legal values (inclusive). Of course, just because a sequence is a legal UTF-8 sequence doesn’t mean that you should accept it (you still need to do all your other checking), but generally you should check any UTF-8 data for UTF-8 legality before performing other checks.
Table 5-1. Legal UTF-8 Sequences
UCS Code (Hex) | Binary UTF-8 Format | Legal UTF-8 Values (Hex) |
---|---|---|
00-7F | 0xxxxxxx | 00-7F |
80-7FF | 110xxxxx 10xxxxxx | C2-DF 80-BF |
800-FFF | 1110xxxx 10xxxxxx 10xxxxxx | E0 A0-BF 80-BF |
1000-CFFF | 1110xxxx 10xxxxxx 10xxxxxx | E1-EC 80-BF 80-BF |
D000-D7FF | 1110xxxx 10xxxxxx 10xxxxxx | ED 80-9F 80-BF |
E000-FFFF | 1110xxxx 10xxxxxx 10xxxxxx | EE-EF 80-BF 80-BF |
10000-3FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F0 90-BF 80-BF 80-BF |
40000-FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F1-F3 80-BF 80-BF 80-BF |
100000-10FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | F4 80-8F 80-BF 80-BF |
As I noted earlier, there are two standards for character sets, ISO 10646 and Unicode, who have agreed to synchronize their character assignments. The earlier definitions of UTF-8 in ISO/IEC 10646-1:2000 and the IETF RFC also supported five and six byte sequences to encode characters beyond U+10FFFF, but such values can’t be used to support Unicode characters. IETF RFC 3629 modified the UTF-8 definition, and one of the changes was to specifically make any encodings beyond 4 bytes illegal (i.e., characters must be between U+0000 and U+10FFFF inclusively). Thus, the five and six byte UTF-8 encodings for characters beyon U+10FFFF aren’t legal any more, and you should normally reject them (unless you have a special purpose for them).
This is set of valid values is tricky to determine, and in fact earlier versions of this document got some entries wrong (in some cases it permitted overlong characters). Language developers should include a function in their libraries to check for valid UTF-8 values, just because it’s so hard to get right.
I should note that in some cases, you might want to cut slack (or use internally) the hexadecimal sequence C0 80. This is an overlong sequence that, if permitted, can represent ASCII NUL (NIL). Since C and C++ have trouble including a NIL character in an ordinary string, some people have taken to using this sequence when they want to represent NIL as part of the data stream; Java even enshrines the practice. Feel free to use C0 80 internally while processing data, but technically you really should translate this back to 00 before saving the data. Depending on your needs, you might decide to be “sloppy” and accept C0 80 as input in a UTF-8 data stream. If it doesn’t harm security, it’s probably a good practice to accept this sequence since accepting it aids interoperability.
Handling this can be tricky. You might want to examine the C routines developed by Unicode to handle conversions, available at ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c. It’s unclear to me if these routines are open source software (the licenses don’t clearly say whether or not they can be modified), so beware of that.
This section has discussed UTF-8, because it’s the most popular multibyte encoding of UCS, simplifying a lot of international text handling issues. However, it’s certainly not the only encoding; there are other encodings, such as UTF-16 and UTF-7, which have the same kinds of issues and must be validated for the same reasons.
Another issue is that some phrases can be expressed in more than one way in ISO 10646/Unicode. For example, some accented characters can be represented as a single character (with the accent) and also as a set of characters (e.g., the base character plus a separate composing accent). These two forms may appear identical. There’s also a zero-width space that could be inserted, with the result that apparently-similar items are considered different. Beware of situations where such hidden text could interfere with the program. This is an issue that in general is hard to solve; most programs don’t have such tight control over the clients that they know completely how a particular sequence will be displayed (since this depends on the client’s font, display characteristics, locale, and so on). One approach is to require clients to send data in a normalized form, and if you don’t trust the clients, force their data into that form. The W3C recommends Normalization Form C in their draft document Character Model for the World Wide Web. Normalization form C is a good approach, because it’s what nearly all programs do anyway, and it’s slightly more efficient in space. See the W3C document for more information.