Unicode

FIELDS OF STUDY

Computer Science; Software Engineering; Information Technology

ABSTRACT

Unicode is a character-encoding system used by computer systems worldwide. It contains numeric codes for more than 120,000 characters from 129 languages. Unicode is designed for backward compatibility with older character-encoding standards, such as the American Standard Code for Information Interchange (ASCII). It is supported by most major web browsers, operating systems, and other software.

PRINCIPAL TERMS

CHARACTER-ENCODING SYSTEMS

In order for computer systems to process text, the characters and other graphic symbols used in written languages must be converted to numbers that the computer can read. The process of converting these characters and symbols to numbers is called “character encoding.” As the use of computer systems increased during the 1940s and 1950s, many different character encodings were developed.

To improve the ability of computer systems to interoperate, a standard encoding system was developed. Released in 1963 and revised in 1967, the American Standard Code for Information Interchange (ASCII) encoded ninety-five English language characters and thirty-three control characters into values ranging from 0 to 127. However, ASCII only provided support for the English language. Thus, there remained a need for a system that could encompass all of the world's languages.

Unicode was developed to provide a character encoding system that could encompass all of the scripts used by current and historic written languages. By 2016, Unicode provided character encoding for 129 scripts and more than 120,000 characters. These include special characters, such as control characters, symbols, and emoji.




The Unicode Standard is a universally recognized coding system for more than 120,000 characters, using either 8-bit (UTF-8) or 16-bit (UTF-16) encoding. This chart shows the character symbol and the corresponding hexadecimal UTF-8 code. For the first





The Unicode Standard is a universally recognized coding system for more than 120,000 characters, using either 8-bit (UTF-8) or 16-bit (UTF-16) encoding. This chart shows the character symbol and the corresponding hexadecimal UTF-8 code. For the first 127 characters, UTF-8 and ASCII are identical.
UNDERSTANDING THE UNICODE STANDARD

The Unicode standard encodes graphemes and not glyphs. A grapheme is the smallest unit used by a writing system, such as an alphabetic letter or Chinese character. A glyph is specific representation of a grapheme, such as the letter A rendered in a particular typeface and font size. The Unicode standard provides a code point, or number, to represent each grapheme. However, Unicode leaves the rendering of the glyph that matches the grapheme to software programs. For example, the Unicode value of U+0041 (which represents the grapheme for the letter A) might be provided to a web browser. The browser might then render the glyph of the letter A using the Times New Roman font.

Unicode defines 1,114,112 code points. Each code point is assigned a hexadecimal number ranging from 0 to 10FFFF. When written, these values are typically preceded by U+. For example, the letter J is assigned the hexadecimal number 004A and is written U+004A. The Unicode Consortium provides charts listing all defined graphemes and their associated code points. In order to allow organizations to define their own private characters without conflicting with assigned Unicode characters, ranges of code points are left undefined. One of these ranges includes all of the code points between U+E000 and U+F8FF. Organizations may assign undefined code points to their own private graphemes.

One inherent problem with Unicode is that certain graphemes have been assigned to multiple code points. In an ideal system, each grapheme would be assigned to a single code point to simplify text processing. However, in order to encourage the adoption of the Unicode standard, character encodings such as ASCII were supported in Unicode. This resulted in certain graphemes being assigned to more than one code point in the Unicode standard.

Unicode also provides support for normalization. Normalization ensures that different code points that represent equivalent characters will be recognized as equal when processing text. For example, normalization ensures that the character é (U+00E9) and the combination of characters e (U+0065) and ? (U+0301) are treated as equivalent when processing text.

SAMPLE PROBLEM

Using a hexadecimal character chart as a reference, translate the following characters into their Unicode code point values:

<, 9, ?, E, and @.

Then select an undefined code point to store a private grapheme.

Answer:

Unicode uses the hexadecimal character code preceded by a U+ to indicate that the hexadecimal value refers to a Unicode character. Using the chart, <, 9, ?, E, and @ are associated with the following hexadecimal values: 003C, 0039, 003F, 0045, and 0040. Their Unicode code point values are therefore: U+003C, U+0039, U+003F, U+0045, and U+0040.

A private grapheme may be assigned any code point value within the ranges U+E000 to U+F8FF, U+F0000 to U+FFFFF, and U+100000 to U+10FFFD. These code points are left undefined by the Unicode standards.

USING UNICODE TO CONNECT SYSTEMS WORLDWIDE

Since its introduction in 1991, Unicode has been widely adopted. Unicode is supported by major operating systems and software companies including Microsoft and Apple. Unicode is also implemented on UNIX systems as well. Unicode has become an important encoding system for use on the Internet. It is widely supported by web browsers and other Internet-related technologies. While older systems such as ASCII are still used, Unicode's support for multiple languages makes it the most important character-encoding system in use. New languages, pictographs, and symbols are added regularly. Thus, Unicode remains poised for significant growth in the decades to come.

—Maura Valentino, MSLIS

Berry, John D. Language Culture Type: International Type Design in the Age of Unicode. New York: Graphis, 2002. Print.

Gillam, Richard. Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard. Boston: Addison-Wesley, 2002. Print.

Graham, Tony. Unicode: A Primer. Foster City: M&T, 2000. Print.

Korpela, Jukka K. Unicode Explained. Sebastopol: O'Reilly Media, 2006. Print.

“The Unicode Standard: A Technical Introduction.” Unicode.org. Unicode, 25 June 2015. Web. 3 Mar. 2016.

“Unicode 8.0.0.” Unicode.org. Unicode, 17 June 2015. Web. 3 Mar. 2016.

“What Is Unicode?” Unicode.org. Unicode, 1 Dec. 2015. Web. 3 Mar. 2016.