1. Field of the Invention
The present invention relates generally to computer systems, and more specifically to user-defined codeset conversions, including operation-based code conversions.
2. Related Art
Computer systems communicate on many different computing platforms. The popularity of the Internet has led to an increased need to transfer documents between these platforms. Difficulties can arise, however, because differing platforms represent character data differently. An overview of character sets and codesets is helpful to understand these difficulties.
A character set is a collection of predefined characters based on the needs of a particular language or use environment. The character set can be composed of alphabetic, numeric, or other characters. Characters may be grouped in a set because they are needed to communicate in a given language or in a given specialized environment. Examples of character sets of the latter type include the symbols necessary to communicate mathematical or chemical formulas.
Once a character set is chosen, a remaining issue is how to represent that character set in a computer system. Collectively, the representation of all the characters of the character set is referred to as a codeset. The codeset defines a set of unambiguous rules that establish a one-to-one relationship between each character of the character set and that character""s bit representation. This bit representation can be considered as a graphical image of the character. It is this image that is displayed on the computer screen and the printed page.
The codeset""s representation can be dependent on the number of bytes used to represent each character, as well as the computer system""s communication protocol. For example, a computer system using a 7-bit communication protocol often represents the same set of characters differently than a computer system using an 8-bit protocol. Thus, the choice of which codeset to use frequently depends on the user""s data processing requirements.
For these reasons, two computer systems may employ different codesets even when working in the same language. Consider for example, the Japanese Industrial Standard character set. This character set can be encoded in a variety of codesets, including (1) standardized codesets such as eucJP, or ISO-2022-JP, (2) a user-defined subset of a standardized codeset, or (3) any other non-standard user-defined codeset. There will come a time, however, when these two computer systems need to exchange data. When this occurs, the data must be converted to the receiving computer""s native character set. In the past, this has been done by using two approaches: one highly primitive and one highly complex.
The first conversion technique required the user to request a conversion by defining a series of simple one-to-one mappings. These mapping requests were then converted to a binary table. This technique was onerous because it required every mapping to be listed individually. And for multi-byte codesets, this technique was particularly problematic: The conversions often could not be expressed in terms of one-to-one mappings. And even if that was not the case, the resulting table sizes were too large for practical use.
The second conversion technique required the user to write algorithmic converters to a defined application program interface. These converters were typically written in the C programming language, or some other compiled language. As such, a compiler, linker, and debugger was needed to test and verify such converters. This complicated system often led to lengthy development cycles. This invention provides a valuable alternative.
Embodiments of the present invention provide methods and systems that facilitate codeset conversions. In a preferred embodiment, a user-defined text file assigns conversion rules between differing codesets. This text file is composed of one or more conditional, or operation-based, conversion elements. A utility evaluates the rules represented by these elements and produces a table that memorializes the rules in a binary file format. This binary file is then used to transform character data of a first codeset to character data of a second codeset. In this manner, the preferred embodiment converts data between differing codesets in an efficient manner. In fact, the preferred embodiment can convert data between differing multi-byte codesets and between multi- and single-byte codesets. The preferred embodiment does not require the writing of complex algorithmic converter functions and is not limited to primitive one-to-one mapping requests. Nor is the user required to compile the text file, as it is passed to the utility as a text file.