Unlike English, where a word is composed of one or more than one of 26 alphabetical letters, a word in Chinese may be composed of one or more than one Chinese character. There are debates on what constitutes a Chinese word. The complexity of the word definition in Chinese is further compounded by the lack of grammatical elements to segment a sentence into multiple words. In Chinese, there are no ‘space characters’ to delimit a word. A user sometimes must read the entire sentence to understand the meaning and be able to determine what words are used to compose a sentence.
In the absence of a clear definition of a word and in order to avoid confusion, we have adopted the term ‘phrase’ to refer to any written series of two or more Chinese characters that is no more than a sentence in length. The present invention is not concerned with single Chinese characters because it is impossible to decide if they are misspelled.
The Chinese spelling problem refers to misspelled characters in a Chinese phrase. In Roman languages, spelling refers to the writing of words with alphabetic letters, while misspelling refers to mistakes in the choice or placement of letters. In non-Roman, character-based languages, such as Chinese, there can be mismatched characters, i.e. one or more of the characters making up a word/phrase can be incorrect. But technically no misspelling because Chinese words/phrases are made up of characters not alphabetic letters. In the interest of communicating the fact that there is an error in a Chinese phrase, we will use misspell and mismatch interchangeably throughout the document.
Besides meaning, every Chinese character is associated with the following attributes:
(A) Radicals: A Chinese character typically is composed of radicals. For instance, the Chinese character ‘’ (good) is composed of two radicals: ‘’ (girl), the root radical, and ‘’ (child), sometimes called the right radical, or a non-root radical. The position (top, bottom, left or right) of the root-radical and non-root radical in a character is non-deterministic, although most of the root-radicals are on the left side in a character. There are 214 root-radicals defined in the Chinese language. In the case of the character ‘’, both ‘’ (girl), and ‘’ (child) are root-radical characters. But, since only one radical can be the root-radical and the left side radical ‘’ (girl) is defined as a root-radical, and the right side ‘’ is defined as the non-root radical. It is, therefore, possible to mistakenly write ‘’ (bubble) as ‘’ (cannon) in a phrase such as ‘’ (follow the same method), since both have the same non-root radical character ‘’. There are two variants of the Chinese writing system: simplified Chinese is used in China and traditional Chinese is used in Taiwan. Most of the characters used in both languages are identical, but some characters that have the same meaning are written differently and, hence, have different radicals. For instance, the simplified Chinese character ‘’ corresponds to two different traditional Chinese characters ‘’ (and) and ‘’ (combine). The meaning of the character ‘’ in simplified Chinese must then be determined from the context of a phrase.
(B) Pronunciation: Various phonetic systems have been invented to record and teach the pronunciation of Chinese characters. In simplified Chinese, the phonetic system used is called ‘pinyin’; and, in traditional Chinese, the phonetic system used is called ‘BoPoMoFo’ or BPMF. Different characters may be pronounced identically. For instance, in Japanese Kanji, both  (probability), and  (to formalize) are pronounced as kakuritsu, but the second characters are different. This is one possible cause of misspelling. Another possible cause of misspelling or misuse a character in a Chinese phrase is the similarity in phonetics between two characters. Consider in Chinese pinyin, ‘fa’ and ‘hua’ sound similar (or relatively similar), so some people may mistakenly write ‘’ (fā fēi, to display) as ‘’ (huā fēi), which is meaningless. In addition, some Chinese characters may have more than one pronunciation, depending on where the character is used in a phrase. For instance, in Chinese, ‘’ is pronounced as ‘chī’ as in ‘’ (eat a meal), and ‘jī’ as in ‘’ (stutter). So, a user may misspell ‘’ as ‘’, which has the same pronunciation as stutter (‘’) but is meaningless as a phrase.
So, it is conceivable that, in writing a Chinese phrase, a user may misuse a character due to misunderstanding radical or pronunciation attributes. A user may mistakenly write a Chinese character in place of another Chinese character that looks similar except that the radicals are not exactly the same. It is also possible that a user may write a Chinese character having a completely different meaning in a phrase. So, to solve the misuse problem in a Chinese phrase, it is insufficient to rely on a single detection method alone (e.g., pinyin or radical). It is necessary to examine all the possible causes of a mistake to correct the misused character in a phrase.
In general, spell checking for a language consists of two major functions: first, the identification of the incorrect letter/character in a word/phrase and second, the correction of the incorrect letters/characters, if possible. When correcting spelling problems, a Roman-based language such as English corrects the alphabetic letters associated with a word, whereas a non-Roman language such as Chinese corrects the characters associated with a phrase.
In comparing the phrases to identify a mismatched character, the edit distance is commonly used in Computer Science. In Roman languages such as English, the edit distance is defined as the number of different letters between two words. The comparison is performed at the letter level. If the edit distance is 0, two words are identical. If the edit distance is 1, there is one alphabetic letter different between two words, and if the edit distance is 2, there are two alphabetic letters different between two words. If the edit distance is greater than 2, then two words are most likely different, and it is either impossible or not worthwhile to correct the spelling error. Once the edit distance is computed, one can then attempt to correct the incorrect letters in a word by comparing them with the corresponding letters in the same position of a correct word.