Conventional spell checkers work by looking up each word in the target document in a dictionary. If a word is not found in, or morphologically derivable from, the dictionary, then it is declared a spelling error. For example, consider the sentence: "I would like teh chocolate cake for dessert." A conventional spell checker would notice that "teh" is not in the dictionary, and thus would flag it as an error.
However, a large class of spelling errors are not detectable by conventional spell checkers; namely, errors in which the misspelled word happens to be a valid word in English. Consider, for example, the sentence: "I would like the chocolate cake for desert." In this sentence, the word "dessert" was intended, but "desert" was typed. Because "desert" can be found in the dictionary, this error will go undetected by conventional spell checkers. This type of error will be referred to as a "context-sensitive" spelling error, because the offending word is a correct word when considered in isolation, but it is incorrect in the context of the sentence in which it occurred.
Several methods have been developed for detecting and correcting context-sensitive spelling errors. The method of Mays et al., as described in Eric Mays, Fred J. Damerau, and Robert L. Mercer, Context based spelling correction, Information Processing & Management, 27(5):517-522, 1991, starts by hypothesizing, for a given sentence, the set of possible sentences that the user may have intended to type. It then determines the probability that each such sentence was in fact the one that was intended. It selects as its answer the sentence with the highest probability of being intended. For example, suppose the method is given the sentence above, "I would like the chocolate cake for desert.", as the target sentence to correct. It generates a large number of possible intended sentences by inserting up to one typo in each word of the given sentence. Its resulting set of possible sentences includes, among others: "A would like the chocolate cake for desert."; "I could pike the chocolate cake far desert."; "I would like the chocolate cake for dessert."; and "I would like the chocolate cake for desert.". Note that the last sentence is the same as the original sentence, and thus represents the possibility that the original sentence, as typed, was the one that was intended.
Determining the probability that each candidate sentence was the one that was intended involves calculating the a priori probability of each sentence; that is, the probability of that sentence appearing as a sentence in English. These a priori probabilities are calculated using a word trigram model. The model estimates the a priori probability of a sentence in terms of the probability of each consecutive 3-word sequence, or word trigram, in the sentence. For instance, for the sentence above, "I would like the chocolate for desert.", the word trigrams would be: (.sub.--, .sub.--, "I"); (.sub.--, "I", "would"); ("I", "would", "like"); ("would", "like", "the"); ("like", "the", "chocolate"); ("the", "chocolate", "cake"); ("chocolate", "cake", "for"); ("cake", "for", "desert"); ("for", "desert", "."); ("desert", ".", .sub.--); and (".", .sub.--, .sub.--). The probability of a word trigram (w.sub.1, w.sub.2, w.sub.3) is the probability that, given that words w.sub.1 and w.sub.2 occur consecutively in a sentence, the next word in the sentence will be w.sub.3. For instance, the probability of the word trigram ("the", "chocolate", "cake") is the probability of seeing the word "cake" after the word sequence "the chocolate".
The method of Mays et al. needs an enormous corpus of training sentences in order to learn these trigram probabilities. To measure each trigram probability reliably, it needs enough sentences to have seen every triple of words that can occur in the English language a statistically significant number of times. The difficulty of obtaining and processing such a huge training corpus is known as a sparse data problem. This problem has led others to develop alternative methods of context-sensitive spelling correction.
Schabes et al., in U.S. patent application Ser. No. 08/252,572, filed Jun. 1, 1994 by Yves Schabes, Emmanuel Roche, and Andrew R. Golding, entitled, "System for correcting grammar based on part-of-speech probabilities" incorporated herein by reference, developed a method that is related to that of Mays et al. However, Schabes et al. use part-of-speech trigrams, rather than word trigrams. For instance, while Mays et al. would use the word trigram ("the", "chocolate", "cake"), Schabes et al. would use the corresponding part-of-speech trigram (ARTICLE, ADJ, NOUN).
Instead of needing sentences illustrating every triple of words that can occur in English, Schabes et al. only need illustrations of every triple of parts of speech, i.e., VERB, ARTICLE, NOUN, etc. This drastically reduces the size of the training corpus that is needed, thereby solving the aforementioned sparse-data problem.
The method of Schabes et al. introduces a new problem, however. Because it analyzes sentences in terms of their part-of-speech sequences, it has trouble with errors in which the offending word has the same part of speech as the intended word. For example, consider again the two sentences: "I would like the chocolate cake for dessert." and "I would like the chocolate cake for desert.". Schabes et al. analyze these sentences in terms of their part-of-speech sequences, namely: PRONOUN MODAL VERB ARTICLE ADJ NOUN PREP NOUN PUNC and PRONOUN MODAL VERB ARTICLE ADJ NOUN PREP NOUN PUNC. Here the intended word, "dessert", and the offending word, "desert", have the same part of speech, i.e., NOUN. Moreover, the entire part-of-speech sequence is the same for the two sentences. Thus the two sentences are essentially indistinguishable to the method of Schabes et al., which analyzes the sentences at the level of their part-of-speech sequences. In general, the method of Schabes et al. is ineffective at correcting context-sensitive spelling errors whenever the offending word and the intended word have the same part of speech.
A third method for context-sensitive spelling correction was developed by Yarowsky and is presented in David Yarowsky, A comparison of corpus-based techniques for restoring accents in Spanish and French text, in Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, Japan, 1994. Yarowsky's method uses neither word trigrams nor part-of-speech trigrams, and is thus immune from both problems mentioned earlier, i.e., sparse data, and the inability to discriminate among words with the same part of speech. Yarowsky applied his method not to the task of context-sensitive spelling correction, but to the related task of accent restoration in Spanish and French. This task is to take a word that has been stripped of any accent, such as "terminara" in Spanish, and to decide whether the intended word is the accented version "terminara" or the unaccented version "terminara". Note that this is a special case of context-sensitive spelling correction, in which the spelling errors always take the form of accent deletion.
To decide which was the intended spelling of the word, e.g., "terminara" or "terminara", Yarowsky's method analyzes the context in which the word occurred. In particular, it tests two kinds of features of the context: context-word features, and collocation features. A context-word feature is the presence of a particular word within .+-.k words of the target word. For instance, suppose Yarowsky's method is used to decide which word was intended, "desert" or "dessert", in the sentence: "I would like the chocolate cake for desert.". One possible context-word feature would be the presence of the word "chocolate" within .+-.20 words of "desert". The presence of "chocolate" would tend to suggest that "dessert" was intended. On the other hand, a different context-word feature, the presence of the word "sand" within .+-.20 words, would tend to suggest that "desert" was intended.
The second type of feature used by Yarowsky's method is collocation features. A collocation feature is the presence of a particular pattern of words and/or part-of-speech tags around the target word. For example, the pattern "for.sub.-- " specifies that the word "for" occurs directly before the target word, whose position is symbolized by an underscore. The presence of this pattern would tend to suggest that "dessert" was intended, as in the sentence above. On the other hand, the pattern "PREP the.sub.-- " would tend to suggest that "desert" was intended, as in: "He wandered aimlessly in the desert.".
Yarowsky's method combines these two types of features, context words and collocations, via the method of decision lists. A decision list is an ordered list of features that are used to make a decision in favor of one option or another. The features are ordered such that the most reliable discriminators appear first in the list. For example, suppose Yarowsky's method is used to decide which word was intended, "desert" or "dessert". It might use the following decision list: (1) "for.sub.-- ""dessert"; (2) "PREP the.sub.13 ""desert"; (3) "chocolate" within .+-.20"dessert"; (4) "sand" within .+-.20"desert". This decision list is used by testing whether each feature in the list in turn matches the target context. The first feature that matches is used to make a decision about the intended spelling of the target word.
Consider, for example, the application of this procedure to the sentence: "I would like the chocolate cake for desert.". The method first tests whether feature (1) matches the context around the target word "desert". This involves checking for the presence of the word "for" before "desert". The test succeeds, and so the method suggests that the target word should be changed to "dessert".
Now consider the application of the method to "desert" in the sentence: "He wandered aimlessly in the desert.". The method tries to match feature (1), but fails, because the word "for" is not found before "desert". It tries to match feature (2), which succeeds, since "in" is a PREP and the word "the" appears before "desert". Because feature (2) suggests "desert", the method accepts the given sentence as correct.
Yarowsky's method uses decision lists to take advantage of two types of knowledge: context words and collocations. For any given target problem, it applies the single strongest piece of evidence, whichever type that happens to be. This is implemented by applying the first feature that matches, where "first" corresponds to "strongest" because the features have been sorted in order of decreasing reliability. While this is a powerful approach, its drawback is that it only brings a single piece of evidence to bear on any one decision. This is disadvantageous when either the method is mistaken in its evaluation of which piece of evidence is strongest, or the strongest piece of evidence is outweighed by several weaker pieces of evidence that together suggest an alternative decision. What is necessary is a new method for context-sensitive spelling correction that bases its decisions not on the single strongest piece of evidence, but on all of the available evidence, thereby avoiding the abovementioned disadvantages of decision lists.