The present invention relates to text processing. In particular, the present invention relates to identifying erroneous characters in text.
In many languages, a large set of distinctive characters is used to represent individual words or small parts of words. Examples of such languages are Chinese, Japanese, Korean, and Arabic. Instead of relying on a small alphabet of symbols to build individual words, these languages rely on thousands of distinctive characters. For example, written Chinese uses more than 5,000 distinctive characters.
One problem with such languages is that many of the characters have a similar shape making it easy for keyboard operators to select the wrong character when entering text using a keystroke method. Errors can also occur when characters are entered phonetically, since many characters have similar pronunciations.
Before performing certain operations on a text, such as checking grammar, synthesizing speech from text, and performing natural language parsing, it is helpful to identify any erroneous characters that may be in the text and find out the correct characters that are intended. Under the prior art, erroneous characters have been detected using simple bigram models that determine the probability of any two characters appearing next to each other in a text. These statistical models are less than ideal because of the scarcity of large sets of text from which to build the models. In most such systems, the systems are only able to detect an erroneous character 54% of the time and are only correct in identifying erroneous characters 61% of the time. In addition, they are often unable to suggest the correct characters. Thus, a better technique for identifying erroneous characters in languages such as Chinese, Japanese, Korean and Arabic would be beneficial.
A method and apparatus are provided that identify confused characters in a text written in a language having a large number of distinct characters. To identify the confused characters, a set of characters from the text are segmented into individual characters. A confusable character for at least one of the segmented characters is then retrieved. Lexical information is identified for both the segmented characters and the retrieved confusable characters and is used to parse the segmented characters and the confusable characters. Based on the parse, a segmented character is identified that has been confused with a confusable character.
In many embodiments of the invention, the confusable characters are retrieved from a confusable character list that associates segmented characters with characters that may be confused with the segmented character. Under some embodiments, the confusable character list contains characters that are graphically similar to their respective segmented character. In other embodiments, the confusable character list contains characters that are phonetically similar to their respective segmented character. In still other embodiments, the invention selects between a graphically similar list and a phonetically similar list based on the method that was used to place the characters into computer-readable form.
In some embodiments of the invention, multi-character words are constructed from the segmented characters and the permutations formed by selectively replacing segmented characters with confusable characters.