Spell checkers are well-known program components used in computer programs to inform users that a word is misspelled, and in some cases, to correct the error to the appropriate spelling. Word processing programs, email programs, spreadsheets, browsers, and the like are examples of computer programs that employ spell checkers.
One conventional type of spell checker corrects errors in an ad-hoc fashion by manually specifying the types of allowable edits and the weights associated with each edit type. For the spell checker to recognize an entry error “fysical” and correct the error to the appropriate word “physical”, a designer manually specifies a substitution edit type that allows substitution of the letters “ph” for the letter “f”. Since it is built manually, this approach does not readily port to a new language or adapt to an individual's typing style.
Another type of spell checker is one that learns errors and weights automatically, rather than being manually configured. One type of trainable spell checker is based on a noisy channel model, which observes character strings actually entered by a user and attempts to determine the intended string based on a model of generation.
Spell checkers based on the noisy channel model have two components: (1) a word or source generation model, and (2) a channel or error model. The source model describes how likely a particular word is to have been generated. The error model describes how likely a person intending to input X will instead input Y. Together, the spell checker attempts to describe how likely a particular word is to be the intended word, given an observed string that was entered.
As an example, suppose a user intends to type the word “physical”, but instead types “fysical”. The source model evaluates how likely the user is to have intended the word “physical”. The error model evaluates how likely the user is to type in the erroneous word “fysical” when the intended word is “physical”.
The classic error model computes the Levenshtein Distance between two strings, which is the minimum number of single letter insertions, deletions, and substitutions needed to transform one character string into another. The classic error model is described in Levenshtein, V. “Binary Codes Capable of Correcting Deletions, Insertions and Reversals.” Soviet Physics—Doklady 10, 10, pp. 707–710. 1966.
A modification of the classic error model employs a Weighted Levenshtein Distance, in which each edit operation is assigned a different weight. For instance, the weight assigned to the operation “Substitute e for i” is significantly different than the weight assigned to the operation “Substitute e for M”. Essentially all existing spell checkers that are based on edit operations use the weighted Levenshtein Distance as the error model, while sometimes adding a small number of additional edit templates, such as transposition, doubling, and halving.
The error model can be implemented in several ways. One way is to assume all edits are equally likely. In an article by Mays, E., Damerau, F, and Mercer, R. entitled “Context Based Spelling Correction,” Information Processing and Management, Vol. 27, No. 5, pp. 517–522, 1991, the authors describe pre-computing a set of edit-neighbors for every word in the dictionary. A word is an edit-neighbor of another word, if it can be derived from the other word from a single edit, where an edit is defined as a single letter insertion (e.g., Ø→a), a single letter substitution (e.g., a→b), a single letter deletion (e.g., a→Ø), or a letter-pair transposition (e.g., ab→ba). For every word in a document, the spell checker determines whether any edit-neighbor of that word is more likely to appear in that context than the word that was typed. All edit-neighbors of a word are assigned equal probability of having been the intended word, and the context is used to determine which word to select. It is noted that the word itself (if it is in the dictionary) is considered an edit-neighbor of itself, and it is given a much higher probability of being the intended word than the other edit-neighbors.
A second way to implement the error model is to estimate the probabilities of various edits from training data. In an article by Church, K. and Gale, W., entitled “Probability Scoring for Spelling Correction,” Statistics and Computing 1, pp. 93–103, 1991, the authors propose employing the identical set of edit types used by Mays et al. (i.e., single letter insertion, substitution, deletion, and letter-pair transposition) and automatically deriving probabilities for all edits by computing the probability of an intended word w given an entered string s. The Church et al. method trains on a training corpus to learn the probabilities for each possible change, regardless of the correct word and entered word. In other words, it learns the probability that an erroneous input string s will be written when the correct word w was intended, or P(s|w). The Church et al. method improves insertion and deletion by including one character of context.
The error model probability P(s|w) used in noisy channel spell correction programs, such as the one described in Church et al., may seem backwards initially because it suggests finding how likely a string s is to be entered given that a dictionary word w is intended. In contrast, the spell correction program actually wants to know how likely the entered string s is to be a word w in the dictionary, or P(w|s). The error model probability P(s|w) comes from Bayes formula, which can be used to represent the desired probability P(w|s) as follows:
      P    (          w      ❘      s        )    =                    P        ⁡                  (                      s            ❘            w                    )                    ·              P        ⁡                  (          w          )                            P      ⁡              (        s        )            
The denominator P(s) remains the same for purposes of comparing possible intended words given the entered string. Accordingly, the spell checking analysis concerns only the numerator product P(s|w) P(w), where the probability P(s|w) represents the error model and the probability P(w) represents the source model.
As application programs become more sophisticated and the needs of users evolve, there is an ongoing need to improve spell checkers. The inventors have developed an improved spell checker that is based on the noisy channel model, which incorporates a more powerful error model component.