The present invention relates to coded character sets. More particularly, the present invention provides methods and apparatus for mapping a coded character set with a second coded character set associated with character attributes. The frame of reference for the present invention is a system that accesses character attributes.
The use of different character coding sets in various software environments has caused incompatibility between computer systems and code ambiguity. Different coding sets are required to represent text, mathematical, scientific, and musical symbols. Specialized character coding sets are needed, for example, to represent Chinese or Japanese characters. Furthermore, codes used to represent one character or symbol in a particular coding set often represent a different character or symbol in a another coding set. For example, some codes may represent the first byte of a two byte ideograph in a different coding set.
The growth of the Internet and the need for software that can be used in different environments and platforms has created a push for universal character coding sets. These universal coding sets contain a character set standard that can be used in many different software environments. One example of such a universal coding set is Unicode. Unicode allows assignment of characters to codes ranging from code 0xc3x9700 to code 0xc3x9710FFFF. The coding space under this definition allows Unicode to represent 1,114,112 different characters. Not surprisingly, however, many of the codes allocated in Unicode are not assigned. Unicode is described in ISO/IEC 10646-1 and is hereby incorporated by reference for all purposes. Aspects of Unicode are also described in the Unicode Technical Standard #6, available from Unicode Inc. and in Bits of Unicode by Mark Davis, available from the Unicode Consortium.
Each character in Unicode, and other universal coding sets, has a character code. Every character code is associated with a set of character attributes. Character attributes include collation weight, whether the character is printable, whether the character is upper or lower case, which character class the character belongs to, etc. The attributes associated with a character are accessed frequently. For example, when a user types a letter xe2x80x9cbxe2x80x9d into a computer system, the computer system examines the attributes associated with the character code for xe2x80x9cbxe2x80x9d to determine whether the character should be displayed on the screen. In another example, when a sort function is used to alphabetize a list of words, the attributes for each character in the list of words is examined to determine how the words are sorted alphabetically.
In Unicode, each character attribute set usually requires approximately 64 bytes of memory. Consequently, a system associating each allocated character code in Unicode with a character attribute set requires 1,114,112 times 64 bytes or 71,303,168 bytes of memory space. Due to this large memory requirement, many computer systems attempt to compress the Unicode, since many of the 1,114,112 possible character codes and character attribute sets are not used. By compressing this data, significant memory space is saved. However, decompression and compression each time a character attribute is accessed can be very inefficient. Other numeric mapping schemes can also consume valuable processing resources or additional memory space.
Each of the currently available techniques for mapping or compressing character code sets has disadvantages with regard to at least some of the desirable characteristics of accessing character attributes. It is therefore desirable to provide a system for mapping a character coding set (such as Unicode) to an optimized character coding set in which the mapping system exhibits desirable characteristics as well or better than the technologies discussed above.
According to the present invention, methods and apparatus are provided to map a character coding set to an optimized character coding set associated with an attribute set.
A system identifies a character code. This character code may be received from keyboard entry, read from memory, or acquired from an external network, for example. This character code comprises an arrangement of bytes. According to specific embodiments, each byte can be identified as a group, plane, row, or cell. The row is mapped to a corresponding row of an optimized character code. The group, plane, or cell of the character code and the optimized character code can be the same. Optionally, the plane, group, and cell are mapped to corresponding planes, groups, and cells of the optimized character code.
Each of the groups, planes, rows, and cells of character codes and optimized character codes can be a value identified by a particular arrangement of bits. In Unicode, the value of each group, plane, row, or cell is equivalent to one byte in a character code. Alternatively, the group, plane, row, or cell can be a value identified by any arrangement of bits in the character code and can be mapped to a different arrangement of bits in the optimized character code.
One aspect of the invention provides a method for mapping character codes to optimized character codes associated with character attributes. The method may be characterized by the following sequence: (1) receiving a character code having a string of bits; (2) identifying a first subset of bits in the character code, wherein the first subset of bits identifies a first row; and (3) mapping the first row to a second row associated with an optimized character code in an optimized character code index, wherein mapping the first row identifies an optimized character code for the received character code.
A second subset of bits in the character code can be mapped an identified as a first plane. The first plane can be mapped to a second plane associated with an optimized character code.
Another aspect of the invention provides an apparatus for mapping character codes to optimized character codes. The apparatus may be characterized by the following features: (1) memory; (2) an input mechanism for receiving a character code; (3) one or more processors coupled with the memory, the processors configured to identify a first subset of bits in the character code, wherein the first subset of bits identifies a first row and maps the first row to a second row associated with an optimized character code in the optimized character code index, wherein mapping the first row identifies an optimized character code for the received character code.
The one or more processor can be further configured to identify a second subset of bits in the character code, wherein the second subset of bits identifies a first plane. The one or more processor can also map the first plane to a second plane associated with an optimized character code in the optimized character code index.
Another aspect of the invention pertains to computer program products including a machine readable medium on which is stored program instructions, tables or lists, and/or data structures for implementing a method as described above. Any of the methods, tables, or data structures of this invention may be represented as program instructions that can be provided on such computer readable media
A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings.