The invention relates, in general, to methods and systems used for the computer processing of text, and more specifically, to the composing and decomposing of text represented according to the Unicode Standard in a computer system.
Computer systems required to process text information, may use an international standard for international coding text. The accepted standard for international coded text information is called the Unicode(copyright) Standard published by Unicode, Inc. According to the Unicode Standard, xe2x80x9ctextxe2x80x9d refers to alphabetic characters as well as punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, etc. The Unicode Standard, Version 2.0 and subsequent versions and revisions thereto, provides the In capacity to encode all the characters used for the major written languages of the world and is incorporated herein by reference. For example, Unicode scripts include Latin, Greek, Armenian, Hebrew, Arabic, Bengali, Thai, Japanese kana, a unified set of Chinese, Japanese, and Korean ideographs, as well as many other languages. The Unicode Standard provides codes for nearly 39,000 characters from the world""s alphabets, symbol collections, and ideograph sets. Left unused for future expansion are 18,000 codes, while over 6,000 codes are reserved for private use. The private use codes are intended to be system or application specific and can be defined by those developing their own system or application.
The Unicode Standard is based on a 16 bit code set that provides codes for more than 65,000 characters, whereby each character is identified by a unique 16 bit value. In fact, there are 65,536, i.e. 2 to the eighth power, possible values inherent in a 16 bit word. The code values of the Unicode Standard are equivalent to the code values of the xe2x80x9cUniversal Character Setxe2x80x9d in two-octet form (UCS-2), which is a subset of ISO/IEC 10646. ISO 10646""s full code set is called Universal Character Set in four octet form (UCS-4). Unicode does not use complex modes or escape codes for constructing or representing characters and thus is a simplified and straightforward approach to representing characters.
The Unicode Standard is based on three underlying premises. The first premise is that the standard must define the smallest useful elements of text being coded. The second premise is that a unique character code must be assigned to each element. Finally, the third premise is that basic rules for encoding and interpreting text must be provided so that programs can successfully read and process the coded text. When defining elements of text for a given language, it must be determined what the smallest textual elements of the language are which are used to create words and sentences. For example, the smallest textual elements would be single graphical elements in many languages. But in other languages, the smallest textual elements may be multiple graphical elements, such as in Devanagari.
Regardless of the language, the smallest textual elements are represented in Unicode as xe2x80x9ccode elementsxe2x80x9d. Code elements serve as the building blocks for Unicode xe2x80x9ccharactersxe2x80x9d, wherein a Unicode xe2x80x9ccharacterxe2x80x9d may be an element itself, e.g. xe2x80x9cuxe2x80x9d, a combination of text elements, e.g., xe2x80x9cuxe2x80x9d, or, to a much lesser extent, a symbol, e.g. xe2x80x9c*xe2x80x9d. For the most part, code elements correspond to the most commonly used text elements. For example, each upper case and lower case letter in the English alphabet is represented by a single code element. As a result, coding of elements under the Unicode Standard remains straightforward with a single value for each element. Where appropriate, the Unicode Standard also defines codes for the presentation of text. For instance, some codes control the direction in which text is written whether left to right or right to left and in rare cases where text must change directions within a single run of script. Also, the Unicode Standard defines explicit characters for line and paragraph endings, but the large majority of codes represent text or code elements.
Typically, interpretation of text by a computer system is accomplished as the text is being processed. For example, consider the case where a user is typing on a computer system using a word processor application. When the computer operator depresses a key or key combination, for example xe2x80x9cshift and dxe2x80x9d, the computer system receives a signal or message that the xe2x80x9cshiftxe2x80x9d and xe2x80x9cdxe2x80x9d keys were simultaneously pressed at the keyboard. This message is encoded by the computer system as a Unicode Standard code. An application, e.g., a word processor, stores the code in memory and also passes it on to the display software for rendering the character on the screen. The display software processes the code and displays the letter xe2x80x9cDxe2x80x9d,; this process continues as typing continues.
While, the Unicode Standard directly addresses encoding and interpreting of text for presentation, it does not address many other actions performed on the text related to presentation or the application itself. For example, the standard does not address issues such as spell checking, that is left to applications. Furthermore, the Unicode Standard does not address the rendering of characters on the screen, such as font and size. The representation or rendering of the character on the screen is called a xe2x80x9cglyphxe2x80x9d. The Unicode Standard does not define glyphs, rather it limits itself to the code value associated with an abstract character entity, such as Latin character xe2x80x9cbxe2x80x9d. It is actually the software or hardware rendering engine of the computer or application program which is responsible for the appearance of the characters on the screen.
In addition, the Unicode Standard does address encoding of xe2x80x9ccomposed character sequencesxe2x80x9d (CCS). CCS refers to the representation of multiple characters rendered together. For example, xe2x80x9cxc3xa2xe2x80x9d is a composed character created by rendering an xe2x80x9caxe2x80x9d and xe2x80x9c{circumflex over ( )}xe2x80x9d together. According to the standard, a CCS is made up of a base character first, occupying a single space, and is followed by one or more non-spacing marks to be rendered in the same space as the base character or a spacing mark to be rendered adjacent to the base character. For often used CCSs, the Unicode Standard defines a single code value to represent the common combination of characters, rather than combining a base character with a combination of other individual characters each time the common CCS is used. These are referred to as xe2x80x9cpre-composedxe2x80x9d characters. For example, the character xe2x80x9cxc3xcxe2x80x9d can be encoded as the single code value U+00FC or as two values where the base character U+0075 represents xe2x80x9cuxe2x80x9d followed by the non-spacing character U+0308 which represents xe2x80x9c{umlaut over ( )}xe2x80x9d, expressed as xe2x80x9cu+{umlaut over ( )}xe2x80x9d.
Decomposition of pre-composed characters is also defined by the Unicode Standard. For example, a word processor importing a text file containing a pre-composed character may decompose the character into its base character and subsequent non-spacing characters if, for some reason, this makes processing within the word processor easier or more efficient. A pre-composed character is simply a special type of CCS, whereby the pre-composed character is represented by a single predefined Unicode value.
The Unicode Standard specifies an algorithm for determining whether CCSs of Unicode are xe2x80x9cequivalentxe2x80x9d. The Unicode concept of equivalence facilitates the interchanging of pre-composed characters with decomposed versions of the same characters and vice versa. Pre-composed characters and character sequences are equivalent if, when fully decomposed and correctly ordered, yield identical elements in identical sequences. The Unicode Standard algorithm decomposes pre-composed characters then orders them according to the Unicode rules based, in part, on each character""s combining class. Elements which combine with other elements are referred to as xe2x80x9ccombining charactersxe2x80x9d and have associated with them a xe2x80x9ccombining classxe2x80x9d. The combining class is a Unicode Standard construct whereby characters are classified based on a precedence which relates to how characters can be combined. As discussed earlier, whether a combining character is spacing or non-spacing relates to how it combines with other characters.
Within the process of decomposing a character sequence, Unicode employs a Canonical Ordering Algorithm, which aids in the performance of equivalence comparisons by determining which characters interact xe2x80x9ctypographicallyxe2x80x9d. Characters interact typographically if their order plays a role in the ultimate positioning of the characters within a sequence. For example, if non-spacing characters within a sequence do not typographically interact, then they are treated as equivalents. In practice, each Unicode combining character is assigned a numerical value indicating other combining characters with which the combining Unicode character typographically interacts. Characters of the same combining class typographically interact, whereas characters of different combining classes do not. The final result of the decomposition process is that the original pre-composed character has been transformed into what is referred to a its decomposed xe2x80x9cnormalized formxe2x80x9d. Typically, the normalized form starts with a base character which is followed by non-spacing combining marks which are ordered within the sequence based on increasing combining class values from left to right.
Examples in this specification may use a xe2x80x9c+xe2x80x9d to indicate a sequence of characters. For example, xe2x80x9cxc3xa2xe2x80x9d decomposed would be represented as xe2x80x9ca+{circumflex over ( )}xe2x80x9d, where xe2x80x9caxe2x80x9d is the base character, xe2x80x9c+xe2x80x9d represents that it is a sequence of characters, and xe2x80x9c{circumflex over ( )}xe2x80x9d is a non-spacing combining mark, i.e., it occupies the same space as the letter xe2x80x9caxe2x80x9d.
From the normalized form, a CCS can also be composed, in accordance with the Unicode Standard, into a pre-composed character, represented by a single Unicode value, assuming that Unicode defines a character for the particular CCS combination.
Representing a CCS or pre-composed character in its decomposed normalized form allows comparisons to determine equivalence among two representations of similar character sequences. A determination of equivalence allows substitutions by a system or application as required by that system or application. Under the current Unicode Standard, and algorithms therein, the process of getting character sequences into normalized forms and ultimately performing comparisons to determine equivalence is often quite time consuming. Also, such process is accomplished as the characters are being requested, during runtime. Therefore, the process diverts valuable processor resources from the application being used. For example, a CCS is first broken down into all of its low level characters through a series of searches and sorts spanning, potentially, the entire Unicode database of more than 65,000 characters. Once all the characters are determined, only then can the string of characters be put into a normalized form, again by a series of sorts based on combining class values. Following normalization of each CCS under comparison, a determination of equivalence can be made.
Composing a character in accordance with the current Unicode algorithms also involves fully decomposing a character string as described above, placing it into normalized form, and then iteratively combining characters according to combining class values. It should be noted that because different non-spacing combining marks within a string can have the same combining class value, there may exist multiple valid normalized forms for a set of characters or a pre-composed character. This fact can make comparison of even normalized forms of CCSs complex and time consuming.
Accordingly, a need exists for a method and apparatus for efficiently decomposing and composing Unicode characters.
The present invention comprises apparatus and methods for efficiently decomposing and composing Unicode characters. A pre-processor accesses a known database of Unicode characters to create decomposition and composition mapping tables. The decomposition mapping table (M) comprises decomposition data for existing Unicode pre-composed characters. Two composition mapping tables are created, one for standard compositions, called the composition mapping table (MT), and one which is used to resolve ambiguous compositions, called a composition xe2x80x9cnormalizedxe2x80x9d mapping table (NMT). Ambiguous compositions occur when combining characters can be validly ordered in more than one sequence. The composition mapping table MT is derived from the decomposition mapping table M and comprises canonically equivalent CCSs for each decomposition therein. The normalized mapping table NMT is derived from the composition mapping table MT and comprises pairs of equivalent CCSs, wherein, although both CCSs are valid, one of the pair is defined as the normalized CCS. In the illustrative embodiment, the mapping tables are created only once and then stored in memory for subsequent use by the system or applications.
The decomposition mapping table (M) is created under the control of a pre-processor. A search engine within the pre-processor obtains each pre-composed character (C) and its corresponding decomposition (D) from the Unicode Standard DB. The pre-composed character C and its associated decomposition D are referred to as the decomposition xe2x80x9ckey value pairxe2x80x9d  less than C,D greater than , which is written into mapping table M. The search engine then analyzes the sub-characters in each D to determine whether D can be further decomposed. Specifically, the search engine determines whether there is a decomposition in mapping table M for each sub-character in a given D. If sub-characters do have decompositions, then they are replaced within D with the decomposition and ultimately sorted using Unicode""s combining class rules to create a Dxe2x80x2. Consequently, a new key pair value  less than C,Dxe2x80x2 greater than  is created, which the writer stores in mapping table M in place of  less than C,D greater than . This process continues until all pre-composed Unicode characters are mapped into the decomposition mapping table with a maximally decomposed character sequence. Characters which do not have decompositions are not processed or added to M.
With the decomposition mapping table M created, an application, for example, may request a decomposition of a source CCS string. In this case, a runtime processor controls the determination of a decomposition for the given source CCS string. Each character (C) within the source CCS is analyzed and decomposed to create a xe2x80x9cresult stringxe2x80x9d, which stores the decomposition as it is being created. If a character C has a decomposition D in M, then D is appended in the result string. However, if C is a combining mark, it is appended to the result string and the result string is then sorted and ordered based on the Unicode combining class rules. Alternatively, if C does not have a decomposition D and is not a combining mark, C is simply appended to the result string. Because the result string is ordered as it is being created, the final result is a normalized fully decomposed version of the original source CCS string. The resulting decomposition is passed back to the application which requested it by the runtime processor and the memory associated with the result string is cleared.
The composition mapping table (MT) is also created using the pre-processor. The search engine of the pre-processor iterates through each Unicode character C in the decomposition mapping table M, and if the decomposition D associated with C has a base character as its first character, the search engine obtains D. Keeping the base character as the first character, a combiner within the pre-processor sorts the remaining characters within the decomposition into all possible combinations. This process produces a set (S) of all possible combinations of decompositions which could be associated with C, wherein each combination within the set is referred to as an xe2x80x9celementxe2x80x9d (E). The pre-processor uses the Unicode combining class rules to determine whether each element is canonically equivalent to C, and discards those elements which are not. The pre-processor continues operating on the remaining elements. If an element includes only two sub-elements or characters, a composition xe2x80x9ckey value pairxe2x80x9d, i.e.,  less than E, C greater than , is written into the composition mapping table MT. The composition key value pair includes the original character C and exactly two sub-elements, where E=E1+E2, in the illustrative embodiment. If an element is comprised of more than two sub-elements, all of the sub-elements are grouped into an element Exe2x80x21, with the exception of the right-most sub-element Ex. The search engine determines whether a composed character C1 exists in mapping table M which corresponds to Exe2x80x21. If a C1 does exist in M, then the group of sub-elements represented by Exe2x80x21 is replaced with C1. Accordingly, a key value pair of  less than C1+Ex, C greater than , results and is written to MT. If a C1 does not exist in M for E1, then that element E is discarded. Because sub-elements may have the same combining class values, they may be combined in different orders and still be canonically equivalent. Consequently, for each C, there may be multiple valid E""s, thus multiple composition key value pairs for the same C within MT.
The pre-processor then creates the NMT by determining whether or not there are ambiguous compositions in the MT. Typically, an ambiguous composition involves at least three characters: a base character B1, a combining character C1, and another combining character C2. When B1xe2x88x92C1+C2 and B1xe2x88x92C2+C1 are both valid compositions, the order in which C1 and C2 should be combined by the pre-processor with base character B1 is ambiguous. Since both forms are valid, one of the two forms is defined as the xe2x80x9cnormalizedxe2x80x9d composition. The NMT provides a mapping of B1xe2x88x92C1+C2 to B1xe2x88x92C2+C1, so that in all instances the normalized composition gets provided to the runtime processor when a composition is created. The pre-processor tests each key value pair, e.g.,  less than E1+E2, C greater than , in the MT with each combining character, e.g., Cxe2x80x2, having a combining class which is not equal to the combining class of E2. If E1xe2x88x92E2+Cxe2x80x2 and E1xe2x88x92Cxe2x80x2+E2 are both valid, the two are entered into the NMT together.
With the composition mapping tables MT and NMT created, an application, for example, may request a composition of a source decomposed character string, which is comprised of a plurality of characters, Cs. In this case, the runtime processor iterates through each C to build a xe2x80x9cresult stringxe2x80x9d. As the result string is being constructed, the runtime processor uses the mapping tables MT and NMT to determine whether Cs within the result string can be combined. If so, the runtime processor replaces those Cs with the valid composition from the MT or NMT. Specifically, if C is a base character, it is added to the result string R and its position (p) within the result string R is stored, wherein C can then be denoted by R(p). However, if C is a combining mark, and p is set, the runtime processor determines whether there exists a composition for R(p)+C in MT, denoted as CM. If no other character within the result string after R(p) has the same combining class value as C, then R(p)+C is replaced with CMT. Alternatively, if p is set and if R(p)+C has a mapping in CNMT=B1+C1 in mapping table NMT, then the runtime processor scans the characters after R(p). If none of the characters have the same combining class value as C or C1, then R(p) is replaced with B1 and C2 is appended to the end of result string R. If p is not set or there is no CMT for R(p)+C, then the combining mark C is appended to the result string and the string is sorted using the Unicode combining class rules. In the case where C is a composite combining mark, C is decomposed into its characters and each character is then analyzed in the same manner as the original characters C get analyzed. Otherwise, C is simply appended to the result string and the runtime processor continues iterating through the remaining Cs. As this process is repeated, Cs are combined in accordance with the entries in the mapping table MT and NMT to produce a CCS which is composed, to the maximum extent possible, and is still the canonical equivalent of the original source decomposed character string. The runtime processor returns the CCS to the application, clears the result string, and unsets p.
According to one aspect of the invention, in a computer system capable of storing and processing data and having access to a Unicode database comprising predefined Unicode characters and predefined Unicode rules for decomposition of Unicode combined character sequences, a method for generating a canonical equivalent Unicode composition or decomposition from a Unicode source combined character sequence string upon request, comprising the steps of: A) reading a mapping table database from a plurality of Unicode source combined character sequence strings; B) receiving from a requesting entity a request for either of a Unicode composition or decomposition given one of the source combined character sequence strings as part of the request transmission; C) retrieving from the mapping table database the requested composition or decomposition based on the source combined character sequence string provided with the request; and D) providing the located and requested composition or decomposition to the requesting entity.
According to a second aspect of the invention, an apparatus for deriving canonical equivalent Unicode compositions from Unicode source combined character sequences comprises a pre-processor and capable of generating i) a Unicode canonical equivalent composition or decomposition from a Unicode source combined character sequence, and ii) data defining a logical association between the source combined character sequence and the canonical equivalent compositions or decomposition; and map table generator coupled to the pre-processor and capable of storing the canonical equivalent Unicode compositions and decompositions received from the pre-processor, the Unicode source combined character sequence, and data defining a logical association between the Unicode source combined character sequence and the canonical equivalent Unicode compositions and decompositions derived therefrom.
According to a third aspect of the invention, a computer program product for use with a computer system having access to a Unicode database of predefined Unicode characters and predefined Unicode rules for decomposition of Unicode combined character sequences, the computer program product comprising a computer useable medium having program code embodied in the medium and configured to produce the canonical equivalent Unicode compositions from a Unicode source combined character sequence string, the program code comprising pre-processor program code capable of generating i) a Unicode canonical equivalent composition or decompositions from a Unicode source combined character sequence and ii) data defining a logical association between the source combined character sequence and the canonical equivalent composition and decompositions; and map table generator program code responsive to the pre-processor program code and capable of storing the canonical equivalent Unicode compositions and decompositions received from the pre-processor program code, the Unicode source combined character sequence, and data defining a logical association between the Unicode source combined character sequence and the canonical equivalent Unicode composition and decompositions derived therefrom.