The present invention relates to word processing systems, and more particularly relates to detecting typographical errors and generating replacement strings in documents that contain Japanese text.
Typographical (spelling) checkers, style checkers, and grammar checkers are common in modem word processing programs. The Japanese language presents interesting problems in this area because of several characteristics of the written language. First, the Japanese language employs several different alphabets, which may be used in combination. Second, Japanese text is typically written without any spaces between words. Third, the Japanese language has a highly productive morphology, which means Japanese words can undergo significant spelling changes to indicate case, tense, politeness, aspect, mood, or voice, etc.
The most commonly used Japanese alphabets (or writing systems) are Kanji, Hiragana, and Katakana. The Kanji alphabet includes pictographs or ideographic characters that were adopted from the Chinese alphabet. Hiragana and Katakana are phonetic alphabets that do not include any characters common to each other or to Kanji. Hiragana is used to spell words of Japanese origin. Katakana is used to spell words of foreign (primarily western) origin. Kanji pictographs are analogous to shorthand variants of Hiragana words in that any Kanji word can be written in Hiragana, though the converse is not true. A single Japanese word can include characters from more than one alphabet.
One of the functions performed by typographical checkers is to detect malformed phrases, or words, and suggest replacement text strings. The types of malformed words detected by typographical checkers include (using the example of xe2x80x9chelloxe2x80x9d): 1) transposed characters (e.g., xe2x80x9chelolxe2x80x9d; 2) Missing characters (e.g., xe2x80x9chlloxe2x80x9d); 3) duplicate characters (e.g., xe2x80x9cheelloxe2x80x9d); 4) extra characters (e.g., xe2x80x9cheplloxe2x80x9d) and 5) a wrong character (e.g., xe2x80x9chwlloxe2x80x9d). One approach to performing typographical checking for the Japanese language is to use a dictionary look-up. This approach looks up every word or stem in the document and compares it against a Japanese dictionary to determine if it is valid. However, over-flagging of some words and under-flagging of typographical errors can occur due to the large number of characters in the Japanese language and non-delimited nature of Japanese text.
Another approach to typographical checking uses a heuristic pattern-match. In this approach, rules are used to identify frequent typographical mistakes. In this approach, though, there is often under-flagging of typographical errors because these errors cannot be easily classified into groups when written in the Japanese language.
Yet another approach to typographical checking uses a statistical likelihood of occurrence. This approach uses a large trained corpus of text to compute a probability of whether any given string of characters is well-formed. This approach suffers from requiring a significant investment in training corpora which often contain typographical errors themselves. In addition, because there are an infinite number of sentences in the Japanese language, it is very difficult to robustly model well-formed strings using this approach.
Therefore, there is a need in the art for an improved method for identifying typographical errors in Japanese text and generating replacement strings for malformed text. An acceptable Japanese language solution should be small enough (in terms of memory requirements) and fast enough to perform satisfactorily in a desktop computer environment.
The present invention satisfies the above-described needs by providing an improved method for detecting typographical errors and generating replacement strings in documents containing Japanese text. The present invention employs a bottom-up approach utilizing a dictionary, heuristics and probability analysis to determine whether a typographical error exists and then utilizes heuristics, finite-state morphology and a dictionary to generate a replacement string.
Generally described, the present invention parses a Japanese sentence using morpho-lexical analysis. The result of the morpho-lexical analysis is a list of valid phrases that are contained in the Japanese sentence and a cost associated with each phrase. The phrase corresponds to the standard phonological unit, called bunsetsu, taught in Japanese schools. The present invention operationally defines a phrase as one or more dictionary words (in their stem or non-conjugated form) prefixed and/or postfixed with zero or more morphemes. Since the phrase is constructed from morphemes and dictionary words (lexical entries), the analysis is described as morpho-lexical in nature. The cost associated with each phrase is derived from the probability that each word and morpheme making it up, and the combination thereof, constitute the intended analysis of the corresponding set of characters in the input sentence. The present invention receives the valid phrases and their associated costs from the morpho-lexical analysis. The valid phrases are then combined in such a way as to create all possible non-overlapping sets of phrases in efforts to find one such set that represents the entire string of characters in the input sentence. For simplicity, these sets of non-overlapping phrases are referred to as phrase lists. When the phrases are combined, their respective costs are also combined, resulting in a summed associated cost for the phrase list. If any phrase list spans the input sentence, i.e., the phrase list exactly duplicates the input sentence, no typographical error exists and processing ceases.
If no spanning phrase list exists, then the phrase list containing the lowest combined associated cost, i.e., the phrase list having the combined associated cost signifying that it is most representative of the input sentence, is selected. Using the selected phrase list, any xe2x80x9cholesxe2x80x9d are determined. A hole is a character, or set of characters, that are found in the input sentence but not in the selected phrase lists. In other words, the hole is a character or set of characters where the selected phrase list does not span the input sentence corresponding to a gap in the analysis. The hole is where any typographical error exists, if any, within the input sentence. For one aspect of the present invention, the hole is checked to determine if any part of it can be analyzed, using morpho-lexical process, as a valid phrase when an extended dictionary is enabled. In addition, rules are applied to determine if any part of the hole can be analyzed, using morpho-lexical process, as a valid phrase when an extended dictionary is enabled. In addition, rules are applied to determine if any part of the hole can be analyzed as a proper noun. The hole may be xe2x80x9crelaxedxe2x80x9d by adding contiguous characters next to the hole from the input sentence and rechecking the xe2x80x9crelaxedxe2x80x9d, hole in the same way as above, i.e., by enabling an extended dictionary and performing a secondary morpho-lexical analysis and by applying a set of proper noun rules.
A replacement string is then generated for the hole. The replacement string is generated using heuristics (rules) intended to counteract the process by which the error was created. The rules match patterns associated with certain types of errors and make appropriate changes to correct those errors, associating a cost with each correction. The replacement candidates thus generated then undergo morpho-lexical analysis and are ranked according to the combination of their associated costs. All candidates which score better than a certain threshold value, i.e., have a lower cost, are presented to the user as potential replacements.
The advantage of the bottom-up approach applied by the present invention for identifying typographical errors is that it greatly reduces the number of searching tasks, and consequently processing, required to find a typographical error. This reduced processing thereby increases performance and efficiency of typographical error checking completed within a desktop computing environment. By increasing performance, more analysis can be done in the same time it takes slower systems, thus resulting in an overall increase in precision. Another advantage of the present invention is that it increases the reliability of the typographical error checking. By reconstructing the input sentence through the use of well-formed words, the bottom-up approach decreases any overflagging errors that may be present in the morpho-lexical analysis. Another advantage of the present invention is that it reduces the over analysis of sentences by rare words. By using a two-pass algorithm in which a primary dictionary of higher frequency words is used to do morpho-lexical analysis first, and then followed by morpho-lexical analysis of holes using an extended dictionary, this approach reduces the errors caused by low frequency or rare words existing in the extended dictionary, yet still allows them to be incorporated into the error-detection process.
These and other aspects of the present invention may be more clearly understood and appreciated from a review of the following detailed description of the disclosed embodiments and by reference to the appended drawings and claims.