1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer systems. More particularly, but not by way of limitation, one or more embodiments of the invention enable a system and method for performing Unicode matching for comparing and merging similar data objects having Unicode strings that are equivalent yet not exact matches.
2. Description of the Related Art
Data objects are database entities that represent objects such as products for example. Data objects may be constructed in a database with number fields, string fields and other field types associated with different types of data such as binary large objects or images for example.
Duplicate data objects in databases occur when two or more data objects exist in a database that actually represent the same object. These duplicate data objects have similar, yet slightly different values in one or more fields that make up the database object. Duplicate data objects are created for example via incorrect data entry or merging of systems that contain slightly different versions of data objects. One such scenario occurs when data objects are entered into a database with string fields that have typographic errors, abbreviations, omissions or transpositions for example. Consolidating duplicate data objects preserves data integrity and minimizes costs associated with maintaining duplicate data objects.
Database object string fields hold characters that represent words in a desired language, for example English. English characters may be encoded using the American Standard Code for Information Interchange (ASCII). Checking words for near matches in ASCII encoded strings is relatively easy since the problem domain is so small, i.e., there are only 128 characters and words are built character by character. In other languages where a single data value represents an entire word, there is no previously known method for determining how “close” one word is to another. This is true since the encoding for the word does not include any of the characteristics of the word such as sound, number of strokes, radicals, geometry or any other characteristic that can be utilized to determine how closely related one word is to another. One such measure of how close one word is to another relates to how “far apart” the two words are in an input method editor graphical user interface, whereby a user may erroneously select one word instead of another, e.g., be “off” by one list entry when selecting a given word.
Traditional Chinese for example includes over 40,000 logograms which represent words. Chinese along with other languages are therefore not capable of being encoded in such a small range of values as are alphabet based languages since an ASCII character readily fits in an 8 bit word, while traditional Chinese requires at least two 8 bit words. Furthermore, Chinese using Basic Multilingual Plane (BMP) encoded in UTF-8 requires up to three 8 bit words in binary computer memory. Japanese is another logogram based language. These types of logogram based languages are generally encoded in “Unicode” for storage of text in databases.
Unicode is an industry standard for representing text that enables consistent representation of text regardless of language. Symbols in Unicode are assigned unique “code points”. Code points may be represented as binary or hexadecimal values for example. An example code point is written as “U+xxxx” where “xxxx” represents a number associated with the code point, e.g., “U+0065” which represents the letter “e”. Encoding a language such as traditional Chinese requires a much larger range of values, or code points when compared to ASCII for example. When checking a particular Unicode code point to determine if it really should be a different word, there is no previously known method to utilized related characteristics associated with the word to determine how close two words are to one another or whether an input error may have occurred for example.
Japanese is another example of a logographic language. Japanese is written using three types of symbols. Kanji symbols include ideographic and pictographic characters adopted from the Chinese language that fit into less than 20 geometric structures. Conceptual words in Japanese such as verbs, adjectives and names for example are generally written using Kanji. Kana symbols are phonetic symbols developed in Japan. Each kana symbol is a phonetic representation of a syllable. Kana is written in one of two ways depending on the type of word it represents, namely hiragana and katakana. Hiragana symbols are utilized in writing native words not written in kanji and inflectional endings of kanji words. Katakana is utilized in writing foreign words. In addition, Romanization of Japanese words is accomplished using 22 roman characters and 2 diacritical marks. Homophones, words sounding alike with different meaning may be represented with different kanji. There are a large number of homophones in Japanese and hence Romanized Japanese is at times difficult to understand even in context. For verbal input methods, homophones present a very real possibility for erroneous data entry. There is no concept of capital versus lowercase letters in Japanese, unlike English. Hence normalization of case in Japanese (for example to all lowercase) before comparison is not possible and hence not needed for comparison purposes. There are two types of Romanization utilized in writing Japanese, Romaji and Hepburn that differ slightly from one another. Although Japanese kanji officially about 2,000 characters, these characters may be mixed with phonetic symbols that make heterographs, i.e., words that are spelled differently but sound and mean the same thing. Erroneous homophone data entry and correct heterograph entry yield data values that may not yield exact spelling matches. Non-exact spelling matches in fields that should be the same signify potential data object merging problems.
Input method editors (IME's) are utilized in entering complex languages into a computer system. Japanese may be entered into a computer in many different ways, including the use of an IME. Use of a Japanese input method editor (IME) on a computer system allows for the selection of characters phonetically, via hiragana and katakana and through use of radicals for example. In addition, Romanized typing of Japanese words on a keyboard or IME is another entry method. In this type of entry, the computer guesses the correct symbol based on the Romanized input and underlines the entry as tentative. Some IME's allow for the entry of a SPACE character to yield a list from which to pick related symbols. Symbols near the correct entry (above or below in the list) may occasionally be accidentally chosen for example. Characters that are close to one another on an IME (next to or above or below the correct symbol in a table) are potential erroneous entry values as a user entering text may select a character near the correct symbol. There are no prior known systems that decompose Unicode code points into related code points based on the type of IME used for data entry.
Chinese may also be entered into computer systems via IME's by breaking up the logograms via structure such as with the Cangjie or Wubi method of typing Chinese, or using phonetic systems such as Zhuyin or Pinyin and selecting possible choices from lists. Since the data entry may be to a closely related erroneous selection, duplicate data objects may result. For example, erroneous selection from a list may result in the entry of a selection that is one logogram away from the desired one. This problem is not unique amongst Japanese and Chinese and is related to any language having a large number of Unicode code points.
Similar issues exist in the entry of other languages such as Korean and the nearly extinct script version of Vietnamese.
When comparing data objects, for example two strings encoded in ASCII, character by character, one word may contain a character that is not in a second word, or the second word may not be found in a dictionary for example. If the rest of the data in each data object compares favorably, then the two objects may actually represent the same data object and hence, may be consolidated. Comparing objects in ASCII is relatively easy since the domain is small (128 characters) and since words constructed in this domain are readily comparable letter by letter.
Comparing strings in Unicode written in languages with small numbers of character sets involves checking for accent marks over characters. For example, comparing strings that include letters having accent marks is performed by transforming single accented characters into corresponding combining sequences. This process is defined as “Unicode normalization”. For example, comparing U+00E9 (a latin small letter e with an acute accent mark) involves breaking the letter into two letters, namely U+0065 and U+0301, i.e., “e” and the acute accent mark combining character so that “e” can be compared against the accented version. Unicode normalization is described in Unicode Technical Report UAX 15. Another term for breaking characters with diacritics is “decomposition”.
Currently, systems attempting to consolidate data objects that actually represent the same object perform string comparisons with Unicode strings and only find exact matches. Hence only exact copies of data objects can currently be consolidated via existing methods. Hence, current systems are highly inefficient when comparing similar strings when the strings are encoded in Unicode that are not exact matches since the domain is so large. There are no known systems that compare closely related data objects in Unicode, e.g., for large characters sets such as Japanese or traditional Chinese, for consolidation.
There are no known comparison systems that decompose logograms or Unicode representations thereof based on the input method used to enter the Unicode string. For at least the limitations described above there is a need for a system and method for performing Unicode matching for comparing and merging similar data objects having Unicode strings that are equivalent yet not exact matches.