The present invention relates to name matching, and more specifically, to native-script and cross-script Chinese name matching. Chinese characters (known as Hanzi in Chinese, Kanji in Japanese, and Hanja in Korean) are used to represent names in several languages, each of which may use different characters for the same underlying name. Even within Chinese itself, there are regional variations. In mainland China and Singapore, for example, a simplified character set is used, while Taiwan and Hong Kong use traditional characters.
Before Unicode was widely adopted, different encoding systems were used for Chinese characters, and the range of characters supported by one encoding system was likely to be different from that of another encoding system. When an electronic text from one region was rendered into a version readable by people from another region, not only did the encoding system need to be converted, but region-specific characters also needed to be changed. For example, the name for the founding father of the People's Republic of China is represented as  in mainland China, as  in Taiwan, and as  in Japan.
The Unicode Consortium reserves a large range of code points to cover essentially all Chinese characters in use. There are many advantages to this, but it also creates some new challenges. One such challenge is that it is no longer obvious what regional variation is being used, since they can appear in the same text as long as there is proper font support. The variant names mentioned above, , ,  and even  may all exist in a single database of personal names. Given any one variant as a query name, the name matching technology must be able to match all the other variants.
Existing name search systems do not have this capability. While the Google search engine, one of the most globally popular search engines, lets the user specify traditional and simplified Chinese as two different language options, it does not automatically convert a query in traditional Chinese characters to its simplified character equivalent or vice versa when specifying the return results in the other language option. Neither does the Baidu search engine, which is one of the most popular search engines in China, have this capability.
The problems described above are compounded by cross-script name matching. Various techniques have been proposed and implemented, particularly within cross-language information retrieval and machine translation, including transliteration, back transliteration, parallel name databases, and machine learning. However, such systems typically overlook that a name in one script may have more than one representation in another script, either because the source name has several readings (e.g. Japanese Kanji names) or the source language has more than one transliteration system in the target language (e.g. Pinyin, Wade-Giles and Yale for Romanizing Mandarin Chinese). Even when such transliteration standards exist, a person may choose a form that is different from any standard convention.
All Chinese characters in Mandarin Chinese are monosyllabic. There are only about 1,350 unique syllables in Chinese counting tones or 410 unique syllables when tone is not considered. With tens of thousands of Chinese characters, a single syllable can therefore be represented by dozens of different characters. As a result, names that may be written in an array of different Chinese characters may be transliterated into the same Romanized form. In other words, there is a many-to-one relationship between Hanzi names and their Romanized forms. Thus, it would be beneficial to have a Chinese name matching system capable of matching both Chinese character variants and Romanized variants while significantly reducing the number of false positives that are possible due to the many-to-one relationship between Chinese characters and their Romanized forms.