February 1995/Code Capsules/Sidebar

Character Sets

A script is a set of symbols used to convey textual information. There are over 30 major scripts in the world. Some scripts, such as Roman and Cyrillic, serve many languages. World scripts can be categorized according to the hierarchy in Figure 1.
Most scripts are alphabetic. The Han script used by Chinese, Japanese, and Korean, however, is an ideographic (or more accurately, logographic) script. Each Han character represents an object or concept — these languages have no notion of words composed of letters from an alphabet.
A character set is a collection of text symbols with an associated numerical encoding. The ASCII character set with which most of us are familiar maps the letters and numerals used in our culture to integers in the range [32, 126], with special control codes filling out the 7-bit range [0, 127]. As the 'A' in the acronym suggests, this is strictly an American standard. Moreover, this standard only specifies half of the 256 code points available in a single 8-bit byte.
There are a number of extended ASCII character sets that fill the upper range [128, 255] with graphics characters, accented letters, or non-Roman characters. Since 256 code points are not enough to cover even the Roman alphabets in use today, there are five separate, single-byte standards for applications that use Roman characters (see Figure 2) .
The obvious disadvantage of single-byte character sets is the difficulty of processing data from distinct regions, such as Greek and Hebrew, in a single application. Single-byte encoding is wholly unfit for Chinese, Japanese, and Korean, since there are thousands of Han characters.
One way to increase the number of characters in a single encoding is to map characters to more than one byte. A multibyte character set maps a character to a variable-length sequence of one or more byte values. In one popular encoding scheme, if the most significant bit of a byte is zero, the character it represents is standard ASCII; if not, that byte and the next form a 16-bit code for a local character.
Multibyte encodings are storage efficient since they have no unused bytes, but they require special algorithms to compute indices into a string, or to find string length, since characters are represented as a variable number of bytes. To overcome string indexing problems, Standard C defines functions that process multibyte characters, and that convert multibyte strings into wide-character strings (i.e., strings of wchar_t, usually two-byte characters). Unfortunately, these multi-byte and wide-character functions are commonly available only on XPG4-compliant UNIX platforms and Japanese platforms. The recently approved Amendment 1 to the C Standard defines many additional functions for processing seqences of mutli byte and wide characters, and should entice U.S. vendors to step out of their cultural comfort zone.

Code Pages
Since standard ASCII consists of only 128 code points, there are 128 more waiting to be used in an eight-bit character encoding. It has been common practice to populate the upper 128 codes with characters suitable for local use. The combination of values 128-255 together with ASCII is called a code page under MS-DOS. The default code page for the IBM PC in the United States and much of Europe (#437) includes some box-drawing and other graphics characters, and Roman characters with diacritical marks. Other MS-DOS code pages include:
863 Canadian-French
850 Multi-Lingual (Latin-1)
865 Nordic
860 Portuguese
852 Slavic (Latin-2)
Non-U.S. versions of MS-DOS define other code pages. You can switch between code pages in MS-DOS applications, but not in U.S. Microsoft Windows (except in a DOS window). Only one code page remains active for Windows-hosted applications throughout an entire Windows session. Different versions of Windows have code pages appropriate for their region. For example, Windows-J for Japan uses a code page based on Shift-JIS. Windows 95 (a.k.a. Chicago) will support full code-page switching.
Since code pages use code points in the range [128, 255], it is important to avoid depending on or modifying the high-bit value in any byte of your program's data. A program that follows this discipline is called 8-bit clean.

Character Set Standards
Seven-bit ASCII is the world's most widely-used character set. ISO 646 is essentially ASCII with a few codes subject to localization. For example, the currency symbol, code point 0x24, is '$' only in the United States, and is allowed to "float" to adhere to local conventions. ISO 646 is sometimes called the portable character set (PCS) and is the standard alphabet for programming languages.
ISO 8859 is a standard that takes advantage of all 256 single-byte code points to define nine eight-bit mappings, to nine selected alphabets (see Figure 2) . Each of these mappings retains ISO 646 as a subset, hence they differ mainly in the upper 128 code points. Some of these mappings are the basis for MS-DOS code pages.
There is no official ISO standard for multibyte character sets in the Far East. However, each region of the Far East has its own local (national) standards. PC-industry standards, based on national standards, are also in common use in the Far East. Examples include Eten, Big Five, and Shift JIS.

ISO 1O646
To simplify the development of internationalized applications, ISO developed the Universal Multiple-Octet Coded Character Set (ISO 10646), to accommodate all characters from all significant modern languages in a single encoding. An octet is a contiguous, ordered collection of eight bits, which is a byte on most systems. ISO 10646 allows for 2,147,483,648 (231) characters, although only 34,168 have been defined. It is organized into 128 groups, each group containing 256 planes of 65,536 characters each (256 rows x 256 columns.
Any one of the 231 characters can be addressed by four octets, representing respectively the group, plane, row, and column of its location in the four-dimensional space. Consequently, ISO 10646 is a 32-bit character encoding. ASCII code points are a subset of ISO 10646 — you just add leading zeroes to fill out 32 bits. For example, the encoding for the letter 'a' is 00000061 hexadecimal (i.e., Group 0, Plane 0, Row 0, Column 0x61).
Plane 0 of Group 0 is the only one of the 32,768 planes that has been populated to date. It is called the Basic Multi-Lingual Plane (BMP). ISO 10646 allows conforming implementations to be BMP-based, i.e., requiring only two octets, representing the row and column within the BMP. The full four-octet form of encoding is called UCS-4, and the two-octet form UCS-2. Under UCS-2, therefore, the hexadecimal encoding for the letter 'a' is 0061 (Row 0, Column 0x61). Row 0 of the BMP is essentially ISO 8859-1 (Latin-l) with the U.S. dollar sign as the currency symbol.
ISO 10646 also defines combining characters, such as non-spacing diacritics. In conforming applications, combining characters always follow the base character that they modify. The UCS-2 encoding for , then, consists of two 16-bit integers: 0061 0301 (0301 is called the non-spacing acute). For compatibility with existing character sets, there is also a single UCS-2 code point for (00e1).
For the most part, only Roman characters have such dual representations. Some non-Roman languages, such as Arabic, Hindi, and Thai, also require the use of combining characters. ISO-10646 specifies three levels of conformance for tools and applications:
Level 1 combining characters not allowed
Level 2 combining characters allowed for Arabic, Hebrew, and Indic scripts only
Level 3 combining characters allowed with no restrictions

Unicode
Unicode is a 16-bit encoding scheme that supports most modern written languages. It began independently of ISO 10646, but with Unicode version 1.1, it is now a subset of 10646 (to be precise, it is UCS-2, Level 3). Unicode also defines mapping tables to translate Unicode characters to and from most national and international character set standards.
Some applications should readily convert to Unicode. Since ASCII is a subset, it is only necessary to change narrow (eight-bit) characters to wide characters. In C and C++, this means replacing char declarations with wchar_t. Some other character sets, such as Thai and Hangul, appear in the same relative order within Unicode, so you just need to add or subtract a fixed offset. Converting Han characters requires a lookup table.
Vendors are now beginning to support Unicode, and tools are available at both the operating system and API levels. Tools supporting the 32-bit encodings of ISO 10646 are not expected for many years — especially since no planes beyond the BMP have been populated.

Bibliography
"UCS Coexistence/Migration," X/Open Internal Report, Doc. No. SC22/WG20 N252, 1993.
The Unicode Standard. The Unicode Consortium, Addison-Wesley, 1991.
Katzner, Kenneth. The Languages of the World. 1986.
Martin, Sandra. "Internationalization Explored," UniForum, 1992.
Plauger, P. J., "Large Character Sets for C," Dr. Dobb's Journal, August 1992.