1. Field of the Invention
The present invention relates in general to coded character sets for representing characters in a computer program, and more particularly to a creation of Unicode characters by converting from non-Unicode characters.
2. Description of the Related Art
Unicode is a new internationally standardized data encoding for character data which allows computers to exchange and process character data in any natural language text. Its most common usage is in representing each character as a sixteen-bit number. This is sometimes called a “double-byte” data representation as a byte contains eight bits.
Most existing computer hardware and software represents specific sets of characters in an eight-bit code, of which ASCII (American National Standard Code for Information Interchange) and EBCDIC (Extended binary-coded decimal interchange code) are typical examples. In such an eight-bit representation (also known as a single-byte representation), the limit of two-hundred-fifty-six (256) unique numeric values imposes a restriction on the set of distinct characters that may be encoded using the two-hundred-fifty-six distinct values. Thus, it is necessary to define different sets of encodings for each desired set of characters.
The chosen set of characters is called a “Character Set”. Each member of the character set can be assigned a unique eight-bit numeric value (“Code Point”) from the set of the two-hundred-fifty-six distinct values (Code Points). A group of assignments of characters and control function meanings to all available code points is called a “Code Page”; for example, the assignments of characters and meanings to the two-hundred-fifty-six code points (0 through 255) of an 8-bit code set is a Code Page. The combination of a specific set of characters and a specific set of numeric value assignments is called a “Coded Character Set”. To distinguish among the many different assignments of characters to codings, each Coded Character set is assigned an individual identification number called a “Coded Character Set ID” (CCSID).
In situations involving ideographic scripts such as Chinese, Japanese, or Korean, a hybrid or mixed representation of characters is sometimes used. Because the number of ideographic characters greatly exceeds the two-hundred-fifty-six possible representations available through the use of an eight-bit encoding, a special sixteen-bit encoding may be used instead. To manage such sixteen-bit representations in computing systems and devices built for eight-bit representations, two special eight-bit character codes are reserved and used in the eight-bit-character byte stream to indicate a change of alphabet representation. Typically, a string of characters will contain eight-bit characters in a single-byte representation. When the first of the two special character codes (commonly called a “Shift-Out” character) is encountered indicating a switch of alphabets, the bytes subsequent to the Shift-Out character are interpreted as double-byte pairs encoded in the special sixteen-bit double-byte encoding. At the end of the double-byte ideographic string, the other special eight-bit character code (commonly called a “Shift-In” character) is inserted to indicate that the following eight-bit bytes are to be interpreted as single-byte characters, as were those characters preceding the “Shift-Out” character. This hybrid representation is sometimes also called a “double-byte character set” (DBCS) representation. When such DBCS strings are mixed with SBCS characters, the representation is sometimes called a “mixed SBCS/DCBS” representation.
Ideographic characters may also be represented as sixteen-bit characters in strings without any SBCS characters other than the special initial “Shift-Out” and final “Shift-In” character codes if they are used in a context where it is known that there are no mixtures of eight-bit characters and sixteen-bit characters. Such usage is sometimes called “pure DBCS”. The Shift-Out and Shift-In codes are still required as the text of the remainder of the program may use single-byte encodings.
To illustrate, assume that the “Shift-Out” character is represented by the character ‘<’ and that the “Shift-In” character is represented by the character ‘>’. Then each of the three representations just described may be written as strings of these forms:
‘abcDEF’ SBCS string
‘AB<wxyz>CD’ mixed SBCS/DBCS string
‘<wxyz>’ pure DBCS string
The actual computer storage representation of each of these three character formats would generally be similar to the following representations. For example, the SBCS string would generally appear in storage as follows:
The hexadecimal encoding of this string in a standard representation may appear as:
After translation to Unicode, the same characters may be represented by the following bytes (shown in hexadecimal encoding):

Similarly, the computer storage representation of a mixed SBCS/DBCS string may generally appear as follows where ‘wxyz’ represents the four bytes needed to encode the two ideographic DBCS characters between the Shift-Out and Shift-In characters, and the ‘?’ strings indicate the specific encodings assigned to the representations of the DBCS characters:
The hexadecimal encoding of this string in a standard representation may appear as follows (wherein the Shift-Out and Shift-In characters have encodings X‘0E’ and X‘0F’ respectively):
When translated to Unicode, the same characters may be represented by the these bytes (shown in hexadecimal encoding):
Note that the Shift-Out and Shift-In characters have been removed, as they are not necessary in the Unicode representation.
For the third type of character string containing pure DBCS characters, the computer storage representation may appear as follows:
The hexadecimal encoding of this string in a standard representation may appear as follows (wherein the Shift-Out and Shift-In characters have encodings X‘0E’ and X‘0F’ respectively):
When translated to Unicode, the same characters would be represented by the these bytes (shown in their hexadecimal encoding):

In typical usage, many coded character sets are used to represent the characters of various national languages. As computer applications evolve to support a greater range of national languages, there is a corresponding requirement to encompass a great multiplicity of “alphabets”. For example, a software supplier in England may provide an account management program to a French company with a subsidiary in Belgium whose customers include people with names and addresses in Danish, Dutch, French, Flemish, and German alphabets. If the program creates billings or financial summaries, it must also cope with a variety of currency symbols. Using conventional technology, it may be difficult, or even impossible, to accommodate such a variety of alphabets and characters using a single eight-bit coded character set.
In other applications, a program may be required to present messages to its users in any of several selectable national languages (this is often called “internationalization”). Creating the message texts requires that the program's suppliers be able to create the corresponding messages in each of the supported languages, which requires special techniques for handling a multiplicity of character sets in a single application.
Unicode offers a solution to the character encoding problem, by providing a single sixteen-bit representation of the characters used in most applications. However, most existing computer equipment creates, manages, displays, or prints only eight-bit single-byte data representations. In order to simplify the creation of double-byte Unicode data, there is a need for ways to allow computer users to enter their data in customary single-byte, mixed SBCS/DBCS, and pure DBCS formats, and then have it converted automatically to the double-byte Unicode representation.