1. Technical Field
The present invention relates in general to data processing and more specifically, to a method, system and computer program product for optimization of single byte character processing employed within a multibyte character encoding scheme.
2. Description of the Related Art
Before computers, language was spoken and then written as shapes on paper. An individual recognizes these shapes and is able to reproduce the sounds and hence the meaning of the message. With the advent of computers, it was necessary to represent characters as codes inside the computer""s memory so that text could be stored and reproduced. Early computers structured their memory into chunks (now commonly known as xe2x80x9cbytesxe2x80x9d) and each of these chunks was used to represent a character. However, different computers used different character encoding schemes, making it difficult to exchange information between computers.
A standard was established by the computer industry for the length of a byte to be eight bits. The eight-bits byte allowed for 256 different characters to be encoded, sufficient to handle upper case English and even to extend to lower case as well. This became known as American Standard Code for Information Interchange (ASCII).
With the spread of computers worldwide, it has become a requirement that computers, when handling text, are able to recognize and manipulate text in different languages. However, most computers are utilized in one language, resulting in a different character set for each of the languages. This works well for single language machine use. However, when information is shared or sent from machines employing disparate languages, a problem arises in that the same code point, i.e., character encoding, may represent different characters in different character sets.
To resolve the problem of having the same code point representing different characters in different character sets, multibyte character encoding schemes, such as xe2x80x9cUnicodexe2x80x9d and xe2x80x9cMulticode,xe2x80x9d were established. For example, Unicode uses a sixteen bit character encoding system designed to support text written in diverse human languages. The multibyte character encoding schemes are designed to allow computer systems to exchange text information unambiguously because each code point represents a unique character.
Presently, however, it is not common to have font sets that are capable of displaying all the possible characters in a multibyte character encoding scheme. Typically, most fonts for western languages, such as English, will display only the first 256 out of a total of 65535 Unicode characters. Thus, if an application program desires to use characters from multiple languages in one sentence, or intends to display a data string that contains, e.g., English text with mathematical symbols, a single font, such as Times Roman, cannot display all the characters. To overcome this limitation, Sun Microsystems, Inc. has introduced the concept of multiple host fonts into their Java virtual machines (JVMs). To illustrate how the concept of multiple host fonts is implemented, consider the following excerpted portion of a font property file:
dialog.0=Arial,ANSI_CHARSET
dialog.1=WingDings,SYMBOL_CHARSET,NEED_CONVERTED
dialog.3=Symbol,SYMBOL_CHARSET,NEED_CONVERTED
#Exclusion Range info,
exclusion.dialog.0=0100xe2x88x9220ab,20adxe2x88x92ffff
Within the font property file, Java Font Dialog is mapped into a series of host fonts, namely Arial, WingDings and Symbol (in the above illustration). For characters that cannot be handled by a particular font, e.g., mathematical symbols using Arial, exclusion ranges of that particular font are specified. For example, in the above illustration, the exclusion ranges for Arial are 0x100-0x20ab and 0x20ad-0xfff, while no exclusion ranges for the fonts WingDings and Symbol are specified. Additionally, each font is associated with a converter that is used to map each, e.g., Unicode, character into bytes understood by the underlying encoding scheme. The general scheme is that if an Unicode character cannot be handled, i.e., not supported, by the first host font, the second host font specified is tried and so on. The goal is to try to support as much of the multibyte character set as desired.
However, the introduction of multiple host fonts causes a significant performance degradation, in particular when drawing text strings. In this case, for every character in the string, two checks must be performed. Firstly, it must be determined if the character is an excluded character for the specified font. Secondly, if it is not an excluded character, can this specific character be mapped into an underlying encoding scheme, such as ISO Latin-1. These multiple checks, however, have become an expensive part, in terms of time required, of text drawing.
It is an object of the present invention to provide a method, system and computer program product for optimizing processing of single byte characters within a multibyte character encoding scheme.
To achieve the foregoing object, and in accordance with the invention as embodied and broadly described herein, a method, system and computer program product for optimization of single byte character processing within a multibyte character encoding scheme is disclosed. The method includes: (1) receiving a data string, (2) passing the data string in its entirety to a first processing routine and (3) thereafter evaluating the data string to determine if any character in the data string is an excluded character of a host font. The method further includes (4) transferring the data string in its entirety to a second processing routine and (5) assessing a limited number of characters in the data string to determine if the data string can be converted under an underlying encoding scheme.
The present invention recognizes that in a multiple font environment, the conventional method of examining each individual character in a data string to determine if the character is an excluded character for a given font, followed by determining if the character can be mapped, i.e., converted, into the underlying encoding scheme are usually unnecessary for data strings that only contain ASCII characters. Typically, only the first host font is required, i.e., in most cases, ASCII characters are not excluded characters and those characters can be mapped into the underlying encoding scheme. The present invention therefore introduces the broad concept of examining the attributes of the data string as a whole, or a limited number of characters within the data string, within one function call, as opposed to examining each character separately with multiple function calls. This approach will facilitate efficient processing of data strings, employing multibyte character encoding, that only contain single byte ASCII characters within a multiple host font environment.
In one embodiment of the present invention, the step of receiving a data string further includes inputting the data string from a data entry device. It should be readily apparent to those skilled in the art that any device that is capable of inputting data, such as a computer keyboard, is well within the intended scope of the present invention.
In another embodiment of the present invention, the step of passing further comprises transmitting the data string to the first processing routine as an array. It should be well understood by those skilled in the art that arrays encompass data representation, such as a table or matrix structure.
In yet another embodiment of the present invention, the step of transferring further comprises transmitting the data string to the second processing routine as an array. As discussed above, it should be well understood by those skilled in the art that arrays encompass data representation, such as table structures.
In another embodiment of the present invention the step of evaluating further includes combining all characters in the data string using boolean operation OR to form a single character, determining if the resulting single character is a single byte character and checking the host font""s exclusion ranges to determine if the exclusion ranges contain a single byte character.
The foregoing description has outlined, rather broadly, preferred and alternative features of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features of the invention will be described hereinafter that form the subject matter of the claims of the invention. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.