A portion of the Disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention relates in general to coded character sets for representing characters in a computer program, and more particularly to a creation of Unicode characters by converting from non-Unicode characters.
2. Description of the Related Art
Unicode is a new internationally standardized data encoding for character data which allows computers to exchange and process character data in any natural language text. Its most common usage is in representing each character as a sixteen-bit number. This is sometimes called a xe2x80x9cdouble-bytexe2x80x9d data representation as a byte contains eight bits.
Most existing computer hardware and software represents specific sets of characters in an eight-bit code, of which ASCII (American National Standard Code for Information Interchange) and EBCDIC (Extended binary-coded decimal interchange code) are typical examples. In such an eight-bit representation (also known as a single-byte representation), the limit of two-hundred-fifty-six (256) unique numeric values imposes a restriction on the set of distinct characters that may be encoded using the two-hundred-fifty-six distinct values. Thus, it is necessary to define different sets of encodings for each desired set of characters.
The chosen set of characters is called a xe2x80x9cCharacter Setxe2x80x9d. Each member of the character set can be assigned a unique eight-bit numeric value (xe2x80x9cCode Pointxe2x80x9d) from the set of the two-hundred-fifty-six distinct values (Code Points). A group of assignments of characters and control function meanings to all available code points is called a xe2x80x9cCode Pagexe2x80x9d; for example, the assignments of characters and meanings to the two-hundred-fifty-six code points (0 through 255) of an 8-bit code set is a Code Page. The combination of a specific set of characters and a specific set of numeric value assignments is called a xe2x80x9cCoded Character Setxe2x80x9d. To distinguish among the many different assignments of characters to codings, each Coded Character set is assigned an individual identification number called a xe2x80x9cCoded Character Set IDxe2x80x9d (CCSID).
In situations involving ideographic scripts such as Chinese, Japanese, or Korean, a hybrid or mixed representation of characters is sometimes used. Because the number of ideographic characters greatly exceeds the two-hundred-fifty-six possible representations available through the use of an eight-bit encoding, a special sixteen-bit encoding may be used instead. To manage such sixteen-bit representations in computing systems and devices built for eight-bit representations, two special eight-bit character codes are reserved and used in the eight-bit-character byte stream to indicate a change of alphabet representation. Typically, a string of characters Will contain eight-bit characters in a single-byte representation. When the first of the two special character codes (commonly called a xe2x80x9cShift-Outxe2x80x9d character) is encountered indicating a switch of alphabets, the bytes subsequent to the Shift-Out character are interpreted as double-byte pairs encoded in the special sixteen-bit double-byte encoding. At the end of the double-byte ideographic string, the other special eight-bit character code (commonly called a xe2x80x9cShift-Inxe2x80x9d character) is inserted to indicate that the following eight-bit bytes are to be interpreted as single-byte characters, as were those characters preceding the xe2x80x9cShift-Outxe2x80x9d character. This hybrid representation is sometimes also called a xe2x80x9cdouble-byte character setxe2x80x9d (DBCS) representation. When such DBCS strings are mixed with SBCS characters, the representation is sometimes called a xe2x80x9cmixed SBCS/DCBSxe2x80x9d representation.
Ideographic characters may also be represented as sixteen-bit characters in strings without any SBCS characters other than the special initial xe2x80x9cShift-Outxe2x80x9d and final xe2x80x9cShift-Inxe2x80x9d character codes if they are used in a context where it is known that there are no mixtures of eight-bit characters and sixteen-bit characters. Such usage is sometimes called xe2x80x9cpure DBCSxe2x80x9d. The Shift-Out and Shift-In codes are still required as the text of the remainder of the program may use single-byte encodings.
To illustrate, assume that the xe2x80x9cShift-Outxe2x80x9d character is represented by the character xe2x80x98 less than xe2x80x99 and that the xe2x80x9cShift-Inxe2x80x9d character is represented by the character xe2x80x98 greater than xe2x80x99. Then each of the three representations just described may be written as strings of these forms:
The actual computer storage representation of each of these three character formats would generally be similar to the following representations. For example, the SBCS string would generally appear in storage as follows: 
The hexadecimal encoding of this string in a standard representation may appear as: 
After translation to Unicode, the same characters may be represented by the following bytes (shown in hexadecimal encoding): 
Similarly, the computer storage representation of a mixed SBCS/DBCS string may generally appear as follows where xe2x80x98wxyzxe2x80x99 represents the four bytes needed to encode the two ideographic DBCS characters between the Shift-Out and Shift-In characters, and the xe2x80x98?xe2x80x99 strings indicate the specific encodings assigned to the representations of the DBCS characters: 
The hexadecimal encoding of this string in a standard representation may appear as follows (wherein the Shift-Out and Shift-In characters have encodings Xxe2x80x980Exe2x80x99 and Xxe2x80x980Fxe2x80x99 respectively): 
When translated to Unicode, the same characters may be represented by the these bytes (shown in hexadecimal encoding): 
Note that the Shift-Out and Shift-In characters have been removed, as they are not necessary in the Unicode representation.
For the third type of character string containing pure DBCS characters, the computer storage representation may appear as follows: 
The hexadecimal encoding of this string in a standard representation may appear as follows (wherein the Shift-Out and Shift-In characters have encodings Xxe2x80x980Exe2x80x99 and Xxe2x80x980Fxe2x80x99 respectively): 
When translated to Unicode, the same characters would be represented by the these bytes (shown in their hexadecimal encoding): 
In typical usage, many coded character sets are used to represent the characters of various national languages. As computer applications evolve to support a greater range of national languages, there is a corresponding requirement to encompass a great multiplicity of xe2x80x9calphabetsxe2x80x9d. For example, a software supplier in England may provide an account management program to a French company with a subsidiary in Belgium whose customers include people with names and addresses in Danish, Dutch, French, Flemish, and German alphabets. If the program creates billings or financial summaries, it must also cope with a variety of currency symbols. Using conventional technology, it may be difficult, or even impossible, to accommodate such a variety of alphabets and characters using a single eight-bit coded character set.
In other applications, a program may be required to present messages to its users in any of several selectable national languages (this is often called xe2x80x9cinternationalizationxe2x80x9d). Creating the message texts requires that the program""s suppliers be able to create the corresponding messages in each of the supported languages, which requires special techniques for handling a multiplicity of character sets in a single application.
Unicode offers a solution to the character encoding problem, by providing a single sixteen-bit representation of the characters used in most applications. However, most existing computer equipment creates, manages, displays, or prints only eight-bit single-byte data representations. In order to simplify the creation of double-byte Unicode data, there is a need for ways to allow computer users to enter their data in customary single-byte, mixed SBCS/DBCS, and pure DBCS formats, and then have it converted automatically to the double-byte Unicode representation.
The present invention comprises a data structure stored in a computer system for representing characters in a computer program, and more particularly to a creation of Unicode characters by converting from non-Unicode characters.
A preferred embodiment of the present invention provides methods for specifying the types of constants whose character values are to be converted to Unicode; for specifying which code page or pages are used for specifying the character encodings used in the source program for writing the character strings to be converted to Unicode; and that can be used to perform conversions from SBCS, mixed SBCS/DBCS, and pure DBCS character strings to Unicode. A syntax suitable for specifying character data conversion from SBCS, mixed SBCS/DBCS, and pure DBCS representations to Unicode utilizes an extension to the conventional constant subtype notation. In converting the nominal value data to Unicode, currently relevant SBCS and DBCS code pages are used, as specified by three levels or scopes derived from either global options, from local AOPTIONS statement specifications, or from constant-specific modifiers. Global code page specifications apply to the entire source program. These global specifications allow a programmer to declare the source-program code page or code pages just once. These specifications then apply to all constants containing a request for conversion to Unicode. Local code page specifications apply to all subsequent source-program statements. These local specifications allow the programmer to create groups of statements containing Unicode conversion requests, all of which use the same code page or code pages for their source-character encodings. Code page specifications that apply to individual constants allow a very detailed level of control over the source data encodings to be used for Unicode conversion. The conversion of source data to Unicode may be implemented inherently to the translator (assembler, compiler, or interpreter) wherein it recognizes and parses the complete syntax of the statement in which the constant or constants is specified, and performs the requested conversion. Alternatively, an external function may be invoked by a variety of source language syntaxes which parses as little or as much of the source statement as its implementation provides, and returns the converted value for inclusion in the generated machine language of the object program. Alternatively, the conversion may be provided by the translator""s macro instruction definition facility.
One aspect of a preferred embodiment of the present invention provides for the specification of the types of constants whose character values are to be converted to Unicode.
Another aspect of a preferred embodiment of the present invention provides for the specification of which code page or pages are used for specifying the character encodings used in the source program for writing the character strings to be converted to Unicode.
Another aspect of a preferred embodiment of the present invention performs conversions from SBCS, mixed SBCS/DBCS, and pure DBCS character strings to Unicode.
Another aspect of a preferred embodiment of the present invention provides a syntax suitable for specifying character data conversion from SBCS, mixed SBCS/DBCS, and pure DBCS representations to Unicode utilizing an extension to the conventional constant subtype notation.
Another aspect of a preferred embodiment of the present invention converts a nominal value data to Unicode using currently relevant SBCS and DBCS code pages as specified by a level or scope.
Another aspect of a preferred embodiment of the present invention provides a global level or scope comprising a global code page specification which applies to an entire source program.
Another aspect of a preferred embodiment of the present invention provides a local level or scope comprising a local code page specification which applies to all subsequent source-program statements.
Another aspect of a preferred embodiment of the present invention provides an individual constant level or scope comprising a code page specification that applies to an individual constant.
A preferred embodiment of the present invention has the advantage of providing ease of Unicode data creation: data can be entered into a program using familiar and customary techniques, and in the user""s own language and preferred character sets, without having to know any details of SBCS, DBCS, or Unicode character representations or encodings.
A preferred embodiment of the present invention has the further advantage of providing an ability to handle multiple single-byte and double-byte input data encodings, each specific to a national language or a national alphabet. Such input data may be written in several convenient forms, such as SBCS, mixed SBCS/DBCS, and pure DBCS.
A preferred embodiment of the present invention has the further advantage of providing a variety of scopes for specifying controls over source data representations and encodings, such that the user has complete control over the range of these specifications, ranging from global (applying to all requested conversions in the entire program), local (applying to a range of statements containing data to be converted) to individual or constant-specific (applying to a single instance of data to be converted).
A preferred embodiment of the present invention has the further advantage of providing an open-ended design allowing easy addition of supported character sets, by simply providing additional Mapping Tables for each supported character set, and without any need to modify the internal logic of the translator (assembler, compiler, or interpreter) to be cognizant of such added character sets and tables.
A preferred embodiment of the present invention has the further advantage of having no dependence on operating system environments or run-time conversion services, which may or may not be available in the environment in which character data in the source programs are being converted to Unicode and translated to machine language.
A preferred embodiment of the present invention has the further advantage of providing a special language syntax specifying constants to be converted to Unicode, creating no conflicts with existing applications. This syntax is also a natural and intuitively familiar extension of the existing syntax for specifying character constants.
A preferred embodiment of the present invention has the further advantage of having no need to prepare nor accept programs written using Unicode characters, and no need for special Unicode-enabled input/output devices or mapping software, because of the ease of data creation and the variety of data formats described above.
A preferred embodiment of the present invention has the further advantage of providing an ability to implement conversions in multiple ways to provide flexibility, including implementations in the translator itself (xe2x80x9cnativexe2x80x9d implementation), or by using macro or preprocessor instructions, or by utilizing the translator""s support for externally-defined and externally-written functions.
A preferred embodiment of the present invention has the further advantage of providing an ability to support normal sixteen-bit Unicode and Unicode UTF-8 character formats as the results of converting any of the source data formats described above.