1. Field of the Invention
This invention relates to the arts of text encoding for multiple languages and scripts, and especially to the arts of determining the equivalence of text strings encoded using different sets of code points.
2. Description of the Related Art
Until recently, the World Wide Web (WWW) has been viewed as mainly a display-oriented technology. As the Web is maturing, there is occurring a gradual transition from display-based protocols to pure content-based protocols. These new content-based protocols make full use of Unicode for representing data. This use of Unicode, however, has not made the transition easy.
Historically, the primary purpose of the Web has been to render and display “unidirectional” (left-to-right or right-to-left) data which is sent from a server computer to a browser computer. Initially, this data was based upon traditional single and double byte character encodings.
User expectations and demands, however, have risen. The promise of global e-business has required the Web to adopt richer and more expressive encodings. This need is being addressed by both Unicode and ISO10646, both of which are well known within the art. FIG. 1 presents the well-known Unicode character-control model, including an application layer (10), a control layer (11), a character layer (12), a codepoint layer (13), and a transmission layer (14).
A primary need for metadata in Unicode occurs in the control layer (11), as one may anticipate. In FIG. 1, a dotted line is used to separate the character layer (12) from the control layer (11) to illustrate the sometimes difficult to define boundary separating characters from control. This inability to provide a clean separation has made the task of developing applications (10) that are based on a Unicode more difficult to implement.
Unicode's ability to represent all the major written scripts of the world makes it an ideal candidate for a universal character encoding. Additionally, conversion into and out of Unicode is easy, due to Unicode being a superset of all significant character encodings. In many cases, this has resulted in multiple ways of encoding a given character. These “design points” or “code points” have made the shift to Unicode both rational and straightforward.
Using Unicode primarily as a means for displaying text has worked extremely well thus far. Furthermore, font creators have employed Unicode has a glyph indexing scheme, fostering Unicode's use for display. Moreover, Unicode's ability to combine characters to form composite glyphs have made it possible to render scripts that until now have been given little attention.
The Web, however, is transitioning from simple unidirectional display to full bidirectional interchange. This is evidenced by the move to Extensible Markup Language (XML) and other content based protocols.
In these content-based protocols, the rendering of data is of secondary importance. Application programs that rely on such protocols, such as speech and search engines, place greater importance on the semantic properties of the data rather than its visual appearance or display value.
This transition, nonetheless, presents significant problems for data interchange, since some characters can be represented in multiple ways using different combinations of code points. Unicode's ability to represent semantically equivalent characters in numerous ways has placed a heavy burden on application program and web site content developers. Determining whether two sequences of code points represent equivalent characters or strings has become difficult to ascertain.
Unicode has attempted to solve this problem by defining a “normalization” method for determining whether two sequences of characters are equivalent. This solution, however, has been met with great resistance by industry, largely due to the complexity of the method and its lack of flexible versioning.
The World Wide Web Consortium (W3C) has also offered a solution to this problem. The W3C provides a method for constructing a normalized form of Unicode. This solution also suffers from problems, due to its inability to fully normalize all of Unicode.
Unicode Characters
Instead of encoding just the commonly occurring “accented” characters, known in Unicode as “precomposed” characters, Unicode permits dynamic composition of characters. This enables the construction of new characters by integrating diacritical marks with characters.
A dynamically composed character is fabricated from a base character followed by one or more combining diacritic marks, rendered using either a single composite glyph or a combination of individual glyphs. For example, FIG. 5 shows the two possible ways of representing the latin capital letter “e” with acute. Line 51 shows the character in its precomposed form, while line 52 shows the character in its decomposed form.
Unicode places no limitation on the number of combining characters that may be incorporated with a base character. In certain cases, multiple diacritical marks may interact typographically. For example, consider line 61 on FIG. 6 where the latin small letter “a” is followed by the combining “ring above” and combining “acute”. When this occurs, the order of graphic display is strictly determined by the order in which the characters occur in the code points. The diacritics are rendered from the base character's glyph outward. Combining characters that are to be rendered above the base character are stacked vertically, beginning with the first combining character and ending with the last combining character that is rendered above the base character. The rendering of this sequence is shown on line 62 of FIG. 6. The situation is reversed for combining characters that are rendered below a base character.
When a base character is followed by a sequence of diacritic characters that do not typographically interact, Unicode permits some flexibility in the ordering of the diacritical marks. For example, line 71 on FIG. 7 shows latin small letter “c” followed by combining “cedilla” and combining “circumflex”. Equivalently, the order of the “cedilla” and the “circumflex” could be reversed, as shown in line 72 of FIG. 7. In either case, the resultant rendering is the same as shown in line 73.
The Unicode standard avoids the encoding of duplicate characters by unifying them within scripts across languages; characters that are equivalent in form are only encoded once. There are, however, characters that are encoded in Unicode that would not have normally been included, because they are variants of existing characters. These characters have been included for purposes of round-trip convertibility with legacy encodings. The prime examples being Arabic contextual forms, ligatures, and other glyph variants.
Additionally, the compatibility area also includes characters that are not variants of characters already encoded., such as Roman numerals. Some more examples of compatibility characters are given on Table 1 in which the class column identifies the type of the compatibility character, the Unicode Character(s) column lists an alternate Unicode character sequence, and the Compatible Character column contains the corresponding compatible character.
TABLE 1Compatibility charactersClassCompatible Character(s)Unicode CharacterGlyph VariantsL • U004C, U00B7L U013FFullwidth FormsA U0041A UFF21Vertical Forms-  U2014| UFE31
Unicode cautions using compatibility characters in character streams, as the replacement of compatibility characters by other characters is irreversible. Additionally, there is a potential loss of semantics when a compatibility character is replaced by another character. For example, when the “L”, latin capital letter “1” with middle dot character U013F is replaced with the “L” and “ ” characters U004C and U00B7, the knowledge that the middle dot should be inside the “L” is lost.
Unicode Normalization
Normalization is the general process used to determine when two or more sequences of characters are equivalent. In this context, the use of the term “equivalent” is unclear, whereas it is possible to interpret the use of “equivalent” in multiple ways. For example, it could mean characters are equivalent when their codepoints are identical, or characters are equivalent when they have indistinguishable visual renderings, or characters are equivalent when they represent the same content.
Unicode supports two broad types of character equivalence, canonical and compatibility. In canonical equivalence, the term “equivalent” means character sequences that exhibit the same visual rendering. For example, the character sequences on lines 71 and 72 of FIG. 7 both produce identical visual renderings, shown on line 73.
In compatibility equivalence, the term “equivalent” is taken to mean characters representing the same content. For example, line 81 of FIG. 8 shows the single “fi” ligature, while line 82 of FIG. 8 shows the compatible two character sequence “f” and “i”. In this case, both sequences of characters represent the same semantic content. The only difference between the two sequences is whether or not a ligature is used during rendering.
Unicode defines four specific forms of normalization based upon the general canonical and compatibility equivalencies, summarized in Table 2 in which the title column indicates the name of the normal form, and the category column indicates the equivalence type.
TABLE 2Normalization formsTitleCategoryDescriptionNormalizationCanonicalCanonical DecompositionForm D (NFD)NormalizationCanonicalCanonical DecompositionForm C (NFC)followed by CanonicalCompositionNormalizationCompatibilityCompatibility DecompositionForm KD (NFKD)NormalizationCompatibilityCompatibility DecompositionForm KC (NFKC)followed by CanonicalComposition
Normalization form D (NFD) substitutes precomposed characters with their equivalent canonical sequence. Characters that are not precomposed are left as-is. Diacritics (combining characters), however are subject to potential reordering. This reordering only occurs when sequences of diacritics that do not interact typographically are encountered, those that do interact are left alone.
In Unicode, each character is assigned to a combining class. Non-combining characters are assigned the value zero, while combining characters are assigned a positive integer value. The reordering of combining characters operates according to the following three steps:                1. Lookup the combining class for each character.        2. For each pair of adjacent characters AB, if the combining class of B is not zero and the combining class of A is greater than the combining class of B, swap the characters.        3. Repeat step 2 until no more exchanges can be made.        
After all of the precomposed characters are replaced by their canonical equivalents and all non-interacting combining characters have been reordered, the sequence is then said to be in NFD.
Normalization form C (NFC) uses precomposed characters where possible, maintaining the distinction between characters that are compatibility equivalents. Most sequences of Unicode characters are already in NFC. To convert a sequence of characters into NFC, the sequence is first placed into NFD. Each character is then examined to see if it should be replaced by a precomposed character. If the character can be combined with the last character whose combining class was zero, then the sequence is replaced with the appropriate precomposed character. After all of the diacritics that can be combined with base characters are replaced by precomposed characters, the sequence is said to be in NFC.
Normalization form KD (NFKD) replaces precomposed characters by sequences of combining characters and also replaces those characters that have compatibility mappings. In this normal form, formatting distinctions may be lost. Additionally, the ability to perform round trip conversion with legacy character encodings may become problematic because of the loss of formatting. NFKC replaces sequences of combining characters with their precomposed forms while also replacing characters that have compatibility mappings.
Further, there are some characters encoded in Unicode that possibly need to be ignored during normalization. In particular, the bidirectional controls, the zero width joiner and non-joiner. These characters are used as format effectors. Specifically, the joiners can be used to promote or inhibit the formation of ligatures. Unfortunately, Unicode does not provide definitive guidance as to when these characters can be safely ignored in normalization. Unicode only states these characters should be filtered out before storing or comparing programming language identifiers, there is no other reference other than this.
To assist in the construction of the normal forms, Unicode maintains a data file listing each Unicode character along with any equivalent canonical or compatible mappings. Processes that wish to perform normalization must use this data file. By having all normalization algorithms rely on this data, the normal forms are guaranteed to remain stable over time. If this were not the case, it would be necessary to communicate the version of the normalization algorithm along with the resultant normal form.
The best way to illustrate the use of normal forms is through an example. Consider the general problem of searching for a string. In particular, assume that a text process is searching for the string “flambe”. At first this seems trivial, however, Table 3 lists just some of the possible ways in which the string “flambe” could be represented in Unicode.
TABLE 3The string “flambé”#Code pointsDescription1U0066, U006C, U0051, U006D,decomposedU0062, U0054, U03012U0055, U005C, U0051, U005D,precomposedU0062, U00E93UFB02, U0061, U005D, U0062,fl ligature, precomposedU00E94UFB02, U0061, U005D, U0062,fl ligature, precomposedU0065, U03015UFF46, UFF4C, UFF41, UFF4D,full width, precomposedUFF42, U00E96UFB02, UFF41, UFF4D, UFF42,fi ligature, full width,U00E9precomposed7U0066, U200C, U006C, U0061,ligature suppression,U006D, U0062, U00E9precomposed8U0066, U200C, U006C, U0051,ligature suppression precom-U005D, U0062, U0065, U0301posed9U0055, U200D, U006C, U0061,ligature promotion precom-U006D, U0062, U00E9posed10 U0066, U200D, U006C, U0061,ligature promotion, decom-U006D, U0062, U0065, U0301posed11 U202A, U0066, U006C, U0061,left to right segment,U006D, U0062, U00E9, U202Cprecomposed12 UFF45, U200C, UFF4C, UFF41,full width, ligature pro-Uff4D, UFF42, U00E9motion, precomposed13 UFF45, U200D, UFF4C, UFF41,full width, ligature sup-Uff4D, UFF42, U00E9pression, precomposed
The character sequences found in Table 3 are all equivalent when transformed into either NKFC or NKFD. In the case of NFKD, all transformations yield the sequence found on row #1 of Table 3, while transformations into NFKC result in the sequence on row #2 of Table 3.
To further demonstrate this, consider the conversion of Line 91 of FIG. 9 copied from row #6 of Table 3 into NFKD. First, the sequence is converted to NFD by replacing precomposed characters with their decomposed equivalents, with the result as shown on line 92 FIG. 9. Second, all characters that have compatibility mappings are then replaced by their corresponding compatibility characters, resulting in the code points shown in line 93 of FIG. 9. The final sequence obtained is the same as the one found on row #1 of Table 3.
The fact that all of the sequences found in Table 3 are equivalent when represented in NFKD is not necessarily true when the sequences are represented in other normal forms. For example, consider line 101 of FIG. 10 copied from row #3 in Table 3. When this sequence is converted to NFD, the result is line 102 of FIG. 10. This does not match the sequence on row #1 of Table 3, hence they are not equivalent.
This situation presents problems for general searching and pattern matching. Without a single normalization form, it is not possible to determine reliably whether or not whether two strings are identical or “equivalent”. The W3C has advocated adopting Unicode's NFC for use on the web. Additionally, W3C recommends that normalization be performed early (by the sender) rather than late (by the recipient). Their recommendation is to be conservative in what you send, while being liberal in what you accept. The major arguments for taking this approach are:                (1) almost all data on the web is already in NFC;        (2) most receiving components assume early normalization; and        (3) not all components that perform string matching can be expected to do normalization.        
There are some problems with this strategy, however. It assumes that the primary purpose of normalization is to determine whether two sequences have identical renderings. In Unicode's NFC, any and all formatting information is retained. This causes problems for those processes that require comparisons to be based only upon raw content, specifically web search engines and web based databases. Additionally, it places undue limitations on the characters that can be used during interchange.
Therefore there is a need in the art for a system and method for normalizing encoded text data such as Unicode data. This new system and method should be capable of normalizing all text within a given text encoding scheme or standard, while at the same time not being bound to any specific version of the encoding standard.