1. Field of the Invention
The present invention relates generally to processing non-Roman based languages. More specifically, fault-tolerant systems and methods to process and correct input spelling errors for non-Roman based languages such as Chinese, Japanese, and Korean (CJK) are disclosed.
2. Description of Related Art
Spell correction generally includes detecting erroneous words and determining appropriate replacements for the erroneous words. Most spelling errors in alphabetical, i.e., Roman-based, languages such as English are either out of vocabulary words, e.g., “thna” rather than “than,” or valid words improperly used in its context, e.g., “stranger then” rather than “stranger than.” Spell checkers that detect and correct out of vocabulary spelling errors in Roman-based languages are well known.
Users of non-Roman based languages such as Chinese, Japanese, and Korean (CJK) often utilize Roman-based (alphabetical) input methods. For example, many Chinese language users use pinyin (phonetic spelling) to input Chinese characters. However, Chinese language users may not know the correct pronunciations (pinyins) of some Chinese characters due to, for example, their dialect and/or accent, and therefore may enter incorrect pinyin inputs.
The conventional pinyin input system typically converts a pinyin input and provides a list of candidate Chinese character sets from which the user may select the intended set of Chinese characters. However, the user's intended character set may not be included in the candidate list as most pinyin input methods have a low or no fault tolerance.
In addition, non-Roman based languages such as Chinese, Japanese, and Korean (CJK) languages generally have no invalid characters encoded in any computer character set, e.g., UTF-8 character set, such that most spelling errors are valid characters improperly used in context rather than out of vocabulary spelling errors. In Chinese, the correct use of words can generally only be determined in context. Thus an effective spell checker for a non-Roman based language should make use of contextual information to determine which characters and/or words in context are not suitable.
Spell correction for non-Roman languages such as CJK languages is also complex and challenging in that there are no standard dictionaries in such languages because the definition of CJK words are not clean. For example, some may regard “Beijing city” in Chinese as one word while others may regard them as two words. In contrast, the English dictionary/wordlist lookup is a key feature in English spell correction and thus English spell correction methods cannot be easily adapted for use in CJK languages. Furthermore, the Chinese language has a high concentration of homographs and homophones as well as invisible (or hidden) word boundaries that create ambiguities that also make efficient and effective Chinese spell correction complex and difficult to implement. As is evident with such differences between Chinese and English, many efficient techniques available for English spell correction are not suitable for Chinese spell correction.
Thus what is needed is a computer system and method for effective, efficient and accurate processing and correcting of spelling errors for non-Roman based languages such as Chinese, Japanese and Korean languages.