1. Field of the Invention
The present invention relates generally to translating Chinese pinyin to Chinese characters. More specifically, systems and methods of classifying user input are disclosed.
2. Description of Related Art
Inputting and processing Chinese language text on a computer can be very difficult. This is due in part to the sheer number of Chinese characters as well as the inherent problems in the Chinese language with text standardization, multiple homonyms, and invisible (or hidden) word boundaries that create ambiguities which make Chinese text processing difficult.
One common method available today for inputting Chinese language text into a computer system is one using phonetic input, e.g. pinyin. Pinyin uses Roman characters and has a vocabulary listed in the form of multiple syllable words. However, the pinyin input method results a homonym problem in Chinese language processing. In particular, as there are only approximately 1,300 different phonetic syllables (as can be represented by pinyins) with tones and approximately 410 phonetic syllables without tones representing the tens of thousands of Chinese characters (Hanzi), one phonetic syllable (with or without tone) may correspond to many different Hanzi. For example, the pronunciation of “yi” in Mandarin can correspond to over 100 Hanzi. This creates ambiguities when translating the phonetic syllables into Hanzi.
Many phonetic input systems use a multiple-choice method to address this homonym problem. Once the user enters a phonetic syllable, a list of possible Hanzi characters with the same pronunciation are displayed. However, the process of inputting and selecting the corresponding Hanzi for each syllable can be slow, tedious, and time consuming. Other phonetic input systems are based on determining the likelihoods of each possible Hanzi character based on the adjacent Hanzi characters. The probability approach can further be combined with grammatical constraints. However, the accuracy of the conversion from phonetic to Hanzi of such methods is often limited when applied to literature (e.g., with many descriptive sentences and idioms) and/or spoken or informal language as is used on the web in user queries and/or bulletin board system (BBS) posts, for example. In addition, low dictionary coverage often contributes to the poor conversion quality in spoken language.
In addition to the homonym problem, a word boundary problem exists when processing Chinese language text. In particular, although more than 80% of words in modern Chinese have multiple syllables and thus contain two or more Hanzi, there is no word separation in the Chinese writing system. Input of phonetic Chinese is usually performed syllable by syllable without accounting for word boundaries. In particular, there is no consistency among users in inputting phonetic Chinese (pinyin) word boundaries. For example, some people consider “Beijing daxue” (phonetic representation meaning Beijing University) as two words while others may regard the pinyin as one word and input the pinyin without any boundaries, i.e., “Beijingdaxue.”
The homonym problem and the lack of word boundaries are two of the main contributing factors that make it difficult to provide an easy, effective and accurate mechanism for Chinese language text input and processing. A given Chinese text input in pinyin may create many ambiguities that the conventional methods cannot properly resolve.
Thus, what is needed is a computer system for effective, efficient and accurate processing and translating phonetic Chinese text, e.g., pinyin, to Chinese characters and/or words.