1. Field of the Invention
The present invention relates to a morphological analysis and, more specifically, to a character string correction system and method for analyzing an input character string (including a symbol, etc.) and outputting a corresponding character string as a recognition result.
2. Description of the Related Art
Morphological analysis divides an input sentence into words and is the most basic step in text processing (or natural language processing), and has therefore been the subject of much research. Most of the prior art however focuses on dictionary retrieval, in the sense that the processor simply receives the words as the writer has delimited them (using spaces and punctuation marks), and merely looks them up in the dictionary (sometimes correcting spelling errors by some approximation method or other).
However, spaces and punctuation marks can be mispositioned or forgotten just as other characters can. Also, some languages (e.g. Chinese, Japanese) don't use spaces to delimit words and moreover some languages (e.g. German, Dutch) allow much freedom in concatenating dictionary words to make new words.
Because of these phenomena, one cannot rely completely on spaces and punctuation marks to indicate word boundaries.
An alternative method is to read the input character by character and to match it with the words in the dictionary character by character. Using this method, the input characters can be processed without a preconceived notion of where a word ends, and a word will be judged to have ended if the characters read in up to a certain point correspond to a dictionary word and the rest of the input fails to match with the dictionary.
Such a mechanism can be implemented in several ways, conceptually the most simple one being to keep the entire dictionary in memory and to successively discard words that don't match the input word.
The most common method, however, involves reorganizing the dictionary words into a number of tables (called TRIE table) that reference each other. Thus there would be one table (TRIE table `root`) that holds the initial characters of all the words in the dictionary. For example, the table entry containing the character `a` would point to another table which holds all the characters in second positions of words starting with an `a`. This method is commonly known as the TRIE method. (Donald E. Knuth. The Art of Computer Programming. Volume 3: Storing and Searching. Addison-Wesley Series in Computer Science and Information Processing. Addison-Wesley Company, Reading (Mass.), 1973.)
FIGS. 1A through 1D show a simple example of some TRIE tables and their connections. The first column in each of the TRIE tables, titled "input character", gives the characters that might occur at the particular point in the word that each of these TRIE table represents. Note that although the tables in FIGS. 1A through 1D contain only alphabetical characters, this is not necessarily so: numbers, punctuation marks and even spaces can also be used in this column.
The TRIE table "root" shown in FIG. 1A is the high order TRIE table storing the first input character of all words, that is, storing all alphabetical characters.
The second column, titled "dictionary word" indicates whether the string of characters read up to this point corresponds to a dictionary entry or not. In the example shown in FIGS. 1A through 1D, this is done by giving the part-of-speech of the entry if it does. For example, "Art", "N", "Prop. N", and "Prep" respectively indicate an article, noun, proper noun, and preposition, and the empty-set symbol ".phi." if it doesn't.
For example, the input character "a" in the TRIE table "r" shown in FIG. 1B corresponds to two dictionary words, that is, the chemical symbol "ra" (n) indicating radium and "ra" (prop. N) indicating the name of an Egyptian god.
The third column, titled "TRIE table link", gives the name of the TRIE table corresponding to the position in the input word achieved by processing the character from the first column to indicate the connection to the subsequent TRIE table. For example, the TRIE table link "r-" of the TRIE table "root" refers to the TRIE table "r-". The TRIE table link "rd-" of the TRIE table "r-" refers to the TRIE table "rd" shown in FIG. 1C. The TRIE table link "rd-" of the TRIE table "rd" refers to the TRIE table "rd.-" shown in FIG. 1D. The TRIE table that contains no entry, for example, the TRIE table "rd.-" shown in FIG. 1D indicates that there are no corresponding dictionary entries having subsequent characters.
The flowchart in FIG. 2 gives a simple example of an implementation of the basic TRIE method. The recognizing process is described below by referring to the character "Rd." in FIG. 2.
When the process starts, the TRIE table "root" is read (step S1), the leftmost character "R" of the input character string is read, and the input pointer is shifted to right (step S2). Then, the read character is checked whether or not it is entered as an input character in the TRIE table "root" (step S3). Since the character "R" ("r") is in the TRIE table "root", the corresponding TRIE table "r-" pointed to by the TRIE table link is read (step S4).
Next, the first character "d" in the remaining characters is read (step S2), and the TRIE table "rd-" pointed by the corresponding TRIE table link in the TRIE table "r-" is read (step S4). Then, the character "." is read (step S2), and the TRIE table "rd.-" corresponding in the TRIE table "rd-" and pointed by the TRIE table link is read (step S4).
Then, a space character " " is read (step S2). Since the TRIE table "rd.-" is empty and has no corresponding entry, it is checked whether or not the character string "Rd." is a dictionary entry (step S5). As the character string "Rd." is short for "Road" and entered in the dictionary, it is recognized as a word (step S6), thereby terminating the process.
The TRIE table "rd.-" contains no entry because the dictionary has no entry of the word starting with the three-character string "rd." followed by subsequent characters.
If the dictionary contains no entry in step S5, the last read character in the read character string is discarded character by character (steps S7 and S8). When the remaining character string matches one of the entries in the dictionary (Yes in step S5), the character string is recognized as a word (step S6). If all characters are discarded and no corresponding dictionary entry is found (No in step S8), the analysis fails (step S9).
Described below is an example of a process in steps S5 through S8 of successfully recognizing a character string in step S6 by discarding the last character in step S7. Assume that the word "catch 22" as well as "catch" is a dictionary entry. Since the entry "catch-" contains " " (space), the next character "t" is read together when the character string "catch the dog" is entered (step S2). However, "t" is not contained in the entry of the TRIE table "catch-" (No in step S3), and the character string "catch " ("catch(space)") is not entered in the dictionary (No in Step S5). Accordingly, the last character " " (space) is discarded (step S7), and the character string "catch" is recognized as a word (step S6).
Detection of spelling errors and spelling alternatives in such a system can be done in two ways, either by waiting for a mismatched to occur and looking for alternatives from that point, or by assuming that any character could be erroneous (even if it matches), and computing alternative paths all the time.
As an example, consider the words "airborne", "airconditioned", and "airport", and assume that "airport" has been misspelled "airbort". In a morphological analysis system that waits for a mismatch, processing will proceed until the final `t` before it realizes something is wrong. It will then assume the `t` is a misspelling for `n`, and continue processing from there, and it would have to backtrack in order to find that "airport" is also a possible alternative. In this example, "airport" is more likely as there is only one misspelled character (the `b`, which should have been a `p`), while if "airborne" were the correct answer the input word would contain two consecutive misspellings (a `t` in place of the `n`, and the final `e` would have been omitted). Therefore, there is a higher possibility that "airport" is correct.
In a morphological analysis system that routinely computes alternative paths, the possibility that the `b` is a misspelling (for `p` or `c`, among others), is assumed straight away when the `b` is read in. This system will read on notice that the `c` is unlikely due to followup characters, hang on to the alternative "airborne" a little longer but in the end dismiss it in favour of "airport".
However, the above described conventional morphological analysis has the following problems.
Note that in the above analysis there is a large number of other paths that have not been taken into consideration: this was done to keep the example simple and each to understand.
Note also that if any character can be a misspelling for any other character (and if any character can be superfluous, and if any character from the intended word can have been left out), then any word can be misspelling for any word, which is obviously not a desirable situation.
This situation is generally forestalled by checking the analysis paths against some kind of criterium or criteria, for instance that there be no more than 2 errors in any word, and to discontinue processing for paths that fail to meet the criterium. Some possible criteria that can be used to check whether an error is probable or not include the number of errors in a word, the presence or not of consecutive spelling errors, or (if errors are weighted) the total weight of the spelling errors detected, or any combination of these strategies. However, in the conventional morphological analysis method, a number of alternative paths still remain after these strategies, thereby leaving the problem of low performance.