In the growing world market environment, the need for globalization of applications has become a necessity. That is, as the world markets are united in an electronic marketplace and businesses compete in the world market, a single representative character encoding environment is needed for global character data string processing. For example, in order to ensure the ability of a company to operate utilizing multiple languages, such as English and Chinese, a coding system that can work with almost all of the world's language character sets, is desired.
Presently, there are many types of character encoding environments utilized in character data string processing. For example, the most widely used encoding environment in the United States of America is American standard code for information interchange (ASCII). While in Europe, the character encoding environment Western Europe 8 (WE8DEC) is utilized.
Both character encoding environments (i.e. ASCII and WE8DEC) utilize a single byte per character (8 bits). Therefore, 256 different characters may be represented by either character encoding environment. In the English language, and most European languages, 256 character representations are more than enough to cover most, if not all, of the possible characters of the language. However, in many Asian countries, for example Japan and China, there are many more than 256 characters. Thus, a single byte character encoding environment is not large enough to represent the language. In fact, due to the amount of Chinese characters, an environment of up to four bytes per character (32 bits) may be required to adequately depict the language.
To solve the problems of a worldwide applicable character set, a globalization character encoding system (Unicode) has been developed by Unicode Consortium. In general, Unicode provides a unique number for every character, regardless of platform, program, or language. The Unicode standard has been adopted by many industry leaders. There are two types of Unicode encoding character sets used for different situations. One is a fixed-width encoding character set such as UTF16, UTF32, and the like. The other is a variable-width encoding character set such as UTF8, and the like.
The fixed-width character sets such as UTF16, UTF32, and the like, require a fixed amount of bits to represent each character. For example, UTF16 requires 2 bytes (16 bits) and UTF32 requires 4 bytes (32 bits). These character sets are suitable for Asian languages. One advantage of the fixed-width character set is that string operations can be very efficient. For example, in a UTF16 character set, a data string that has 66 bytes is immediately recognized as having 33 characters.
However, characters in the variable-width character set UTF8 may be represented by one, two, or three bytes. One significant advantage of UTF8 is that ASCII is a subset of UTF8 encoding. Therefore, any data used in an ASCII environment can be directly used in UTF8 without any migration effort. Another advantage is that it is very suitable to the mixed language environment where the majority of data is ASCII. In such an environment, the majority of ASCII data will be represented as one byte per character.
In contrast, the fixed-width encoding character sets such as UTF16 and UTF32 require that each ASCII character be stored within multiple bytes which may cause a deleterious drain on system resources. Therefore, for storage requirements for the ASCII data will be much smaller when stored in UTF8. Due to the above stated requirements, UTF8 has been widely adopted in the mixed language environments.