A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
Although the availability of interactive spelling checkers is widespread, users do not like to use such systems because they are tedious. Interactive spelling checkers ask the user about any word that does not appear in the dictionary, even though most such words are valid. Such dictionary-based systems also do not detect valid word errors where the user accidentally substitutes one word for another. Even when the interactive systems do catch the errors (e.g., when the error yields a word that is not found in the dictionary), the first-guess accuracy is low, forcing the user to select the correct word from among a list of candidate alternatives. If the systems were to select the top-ranked candidate correction for automatic substitution, the low first-guess accuracy would mean that more than half of the automatic substitutions would be incorrect. Because of the extra effort involved and the tedious nature of the user interfaces, many users decide not to use interactive spelling checkers.
The present invention addresses these problems with known interactive spelling checkers. Since it has near-perfect first-guess accuracy, it can automatically correct errors as the user types without introducing new errors. It shifts the emphasis from recognizing valid words to recognizing errors. Identifying the nature of the error often allows correction of the error, even if there is no similar word in the valid word dictionary. Although there are existing systems based on dictionaries of common spelling errors and their associated corrections, these systems are limited to recognizing only the errors explicitly listed in the dictionary. The typical error dictionary contains about a thousand of the most common errors. The present invention presents a rule-based method for detecting and correcting spelling and grammar errors. The invention is not guaranteed to catch all errors, but those that it does correct are extremely likely to be genuine spelling and grammar errors. A variation of this invention for handwriting recognition and optical character recognition (OCR) improves the recognition accuracy of such systems.
A xe2x80x9cregular expressionxe2x80x9d is a computer programming construct that comprises an n-gram template to be matched against a string of characters in a word. The n-gram template string may comprise less than all characters in the word. Matching the string either succeeds or fails. A matched pattern may cause addition, deletion, transposition and/or substitution of characters in the word. The n-gram template may comprise alternative characters, wild card characters and position indicators.
Briefly, according to one embodiment of this invention, there is provided a computer implemented method which does not require a stored dictionary of valid words for correcting spelling errors in a sequence of words. The method comprises the steps of storing a plurality of spelling rules defined as regular expressions for matching a potentially illegal n-gram which may comprise less than all letters in the word and for replacing an illegal n-gram with a legal n-gram to return a corrected word. A word from the sequence of words is submitted to the spelling rules. If a corrected word is returned, it is substituted for the misspelled word in the sequence of words. The method may comprise submitting a corrected word to at least one additional rule.
According to another embodiment of this invention, there is provided a method of correcting both spelling errors and grammar errors. The method comprises storing a plurality of spelling and grammar rules defined as regular expressions given the context of one or more adjacent words. At least two adjacent words at a time from the sequence of words are submitted to the rules. If a corrected word or sequence of corrected words is returned, it is substituted in the sequence of words.
Preferably, an exception list is associated with each regular expression or with the system as a whole to prevent n-gram replacement where the word matches an exception to the rule. Preferably, the spelling rules match potentially illegal n-grams comprising two or more characters. More preferably, the spelling rules recognize and correct complex types of errors in addition to simple insertions, deletions, substitutions and transpositions.
Applications of the methods disclosed herein include word processing programs that automatically correct errors as the user types, word processing programs with batch spelling correction, optical character reader programs and automatic handwriting recognition programs.
Most preferably, the methods according to this invention include storing spelling rules using multiple words in context to identify spelling errors, confusable words and common grammar errors to identify a unique correction from more than one possible correction or word boundary errors comprising missing spaces, inserted spaces, shifted spaces and combinations thereof.
According to a preferred embodiment, the stored rules include constraints based on case restrictions, parts of speech, capitalization and/or punctuation appearing within the sequence of words.
The methods according to this invention may also include a step for generating potential spelling rules defined as regular expressions comprising selecting as templates letters from errors in an error corpus and zero or more letters of context to identify a set of potential rules and the pruning from the set of potential rules those that are too general, too specific or do not identify the cause of the error. New rules may be generated based upon the user""s manual corrections.
A further embodiment of this invention comprises a word completion method that is context sensitive comprising the steps of storing a plurality of word completion rules defined as regular expressions for matching an n-gram which may comprise less than all letters in the word and for replacing a matched n-gram with an n-gram to complete the word given the context of one or more preceding words. The previous word and n-gram comprising the initial letters of a word being typed are submitted to the rules. If a rule is fired, the word being typed is completed automatically.
The present invention goes beyond the state of the art by recognizing more than just isolated whole-word errors. It uses rules that recognize error patterns and their associated corrections. An error dictionary that contains only whole words can correct only as many errors as are listed in the dictionary. The rules used by the present invention can each correct numerous common errors without reference to a valid word dictionary. In essence, the present invention is not just recognizing the error, but also recognizing the cause of the error. This yields much more productive rules and, hence, a more powerful system.
The rules used by this invention are implemented by use of regular expressions, case-restriction flags, space deletion, insertion and shifting, and multiple words of context (including not just whole words and parts of speech, but also regular expressions). This allows the system to correct errors in a context-sensitive fashion, correct word-boundary errors and correct many valid word errors. The present invention can also correct many grammatical and lexical choice errors.
Regular expressions used by this invention include not just sequences of alphanumeric characters and start-word and end-word flags, but also more abstract patterns, such as left and right handedness of the letters, sets of letters, and the letter that corresponds to toggling another letter""s shift bit. The regular expressions are not limited to just the letters involved in the error, but can optionally include multiple letters of context on either or both sides of the error. The regular expressions are constructed to contain just enough context to uniquely identify the nature of the error and hence the corresponding correction. This means that the rules generalize beyond the specific examples that motivated the rule, but are not so general as to introduce new errors into correctly spelled text. It also means that the rules are not limited to single insertions, substitutions, deletions and transpositions, but can also handle other types of errors. It can handle transpositions of letters around one or more letters, such as the transposition of consonants around one or more vowels or the transposition of vowels around one or more consonants. The regular expressions are not limited to bigrams or trigrams, but can be n-grams of any length. The determining factor is the length needed to uniquely identify the correction, not blind selection of all n-grams of a specific length.
The rules used by this invention are bidirectional. Normally, the only use for bidirectional rules would be to randomly introduce natural-seeming errors into correct text. However, the bidirectional rules are useful for xe2x80x9ccorrectingxe2x80x9d between British English and American English without requiring a separate set of rules for each direction. If the user specifies that he/she is writing British English, the system simply runs the rules that correct British English to American English in reverse.
Rule-chaining allows multiple errors to be corrected by multiple rules, as well as more complex spelling conventions to be represented by several rules.
The combination of multiple constraints improves the quality of the system. For example, f/v replacement would normally replace the word xe2x80x9cknifexe2x80x9d with the word xe2x80x9cknivesxe2x80x9d when adding the suffix xe2x80x9csxe2x80x9d. But when xe2x80x9cknifexe2x80x9d is used as a verb, the word xe2x80x9cknifesxe2x80x9d is acceptable. Thus, whether the rule identifying xe2x80x9cifesxe2x80x9d as an error should apply depends on the imputed part of speech of the affected word.
The rules used by this invention may include lists of exceptions which may themselves be regular expressions in addition to whole words. This often yields a significant reduction in the number of rules. It also makes it easier for the user to override the operation of the system for particular words.
In the following examples of rules, the $ character signifies end of word and the {circumflex over ( )} character signifies start of word. Any exceptions are listed after the rule in parentheses, delimited by commas. Square brackets indicate that any of the enclosed characters can appear in the given position, conflating what would otherwise be several rules.
mnet$xe2x86x92ment
fuly$xe2x86x92fully
{circumflex over ( )}htxe2x86x92th (html, http)
ierdxe2x86x92eird
eif$xe2x86x92ief
the anotherxe2x86x92the other
corectxe2x86x92correct
its axe2x86x92it""s a
{circumflex over ( )}a$ {circumflex over ( )}[aeio]xe2x86x92an ({circumflex over ( )}a$, {circumflex over ( )}one$, {circumflex over ( )}one-)
away formxe2x86x92away from
at therexe2x86x92at their
of ofxe2x86x92of
their seemxe2x86x92there seem
Note that the xe2x80x9cmnetxe2x80x9d rule is restricted to words whose last four letters are xe2x80x9cmnetxe2x80x9d, whereas the xe2x80x9clierdxe2x80x9d rule can include words in which xe2x80x9cierdxe2x80x9d appears in the middle, such as xe2x80x9cwierdlyxe2x80x9d. Even the rule involving the misspelled word xe2x80x9ccorectxe2x80x9d is general because it not only covers the pair mapping corect to correct, but also the rule will match and correct many more spelling errors, such as xe2x80x9ccorectlyxe2x80x9d, xe2x80x9ccorectedxe2x80x9d, xe2x80x9ccorectionxe2x80x9d and so on. If one wanted to restrict this rule to matching only whole words, one would specify the constraint as xe2x80x9c{circumflex over ( )}corect$xe2x80x9d. Also note the xe2x80x9cof ofxe2x80x9d rule, which corrects a common example of repeated words. Other spelling checkers flag any example of repeated words, even though xe2x80x9cnine one onexe2x80x9d is not an error. The purpose of these rules is to only include errors that are certain to be incorrect, not flag all possible errors.
The present invention does not correct all errors since some errors do not unambiguously specify their correction, even given context information. In such cases, rules may generate multiple candidate substitutions and allow the user to choose from among the candidate corrections. In any event, the present invention can be used in combination with traditional interactive spelling correction systems. One way is in parallel. The other way is where the correction proposed by the present invention is listed first in the set of candidate corrections proposed by the interactive correction system. If the user should choose not to use the interactive spelling correction system, the automatic spelling correction system will at least have improved the quality of their writing somewhat. Given the realities of user boredom and the tedious nature of batch spelling correction systems, automatic spelling correction will improve spelling accuracy.
According to another embodiment, the present error correction method can learn from the user""s own corrections. When the system detects the use of deletion or transposition or insertion followed by or preceded by cursor movement, it records the word before the correction as well as the result of the user""s correction. In cases of multiple insertions, deletions and transpositions, it waits until cursor movement moves outside the word to initiate learning. If the error resulted from the action of the automatic correction system (i.e., the user undid the effects of the automatic correction), the system adds the word to an exception list for the rules that generated the error. When the exception list for a rule grows too large, it triggers the rule induction system to refine the rule. If the user did not undo a correction, the system applies the rule induction system to generate a new rule to address the error and similar errors in the future. Thus, the system can adapt to the user""s own typing habits.
According to yet another embodiment, the present method may also learn from the user""s behavior in using the interactive correction system. If the user made the same error multiple times and always chose the same correction for the error, the system may be configured to ask the user whether it can add the error-correction pair to the automatic correction system. If the user agrees, this will trigger the rule induction system.
A key to the effectiveness of the present invention is how the rules are produced. A large collection of spelling and typing errors made by real people in a natural setting has been gathered. The initial set of rules were then written by hand, often inspired by specific examples from the error corpus. The rules were tested in various ways before being added to the code. For example, a rule was run on an 80,000 word dictionary to verify that it does not introduce errors into valid words. If there are any exceptions, they must be added to the rule or the rule discarded.
New rules, however, may be generated automatically by one of two methods. The first method tries to find the rule that maximally matches the error corpus while minimizing the number of exceptions. The second method is somewhat more cautious in the generalizations it accepts, requiring rules to be statistically representative of the error corpus from a generative perspective. This means that applying the inverse of the rule to the dictionary should yield spelling errors with a similar distribution to that of the corpus. For example, the first method generated the rule
atiixe2x86x92ati
to account for errors like xe2x80x9cinspiratiionxe2x80x9d and xe2x80x9cgeneratiivexe2x80x9d. All of the errors in the error corpus that match xe2x80x9catiixe2x80x9d end in xe2x80x9cationxe2x80x9d or xe2x80x9cativexe2x80x9d. Applying the inverse of this rule to the dictionary, however, one finds that only half of the errors generated by the inverse rule end in xe2x80x9cationxe2x80x9d or xe2x80x9cativexe2x80x9d. This suggests that although the rule matches all of the errors, it generalizes beyond the cause of the spelling error. One needs to add additional context characters to the rule in order to limit it to just the cases that reflect the nature of the error. Caution is needed in developing rules for an automatic correction system because no dictionary can be complete. For example, most dictionaries do not include personal and family names. The present invention is able to correct spelling errors in names without introducing any new errors. It is desired to minimize the likelihood of a rule causing an error while still maximizing the number of errors it can correct. In an interactive correction system where one wants to identify possible errors without 100% first-guess accuracy, the first of the two systems is to be preferred because of the greater generality of the rules it generates.
In the first rule-design method, each error from the error corpus generates many potential rules by including zero or more characters on either side of the point of the error. Each time a character is added on the left-hand side of the rule, the corresponding character is added to the right-hand side of the rule. For the purpose of rule generation, rules are thought of as simply a multiple-character substitution pair. This encompasses all major types of spelling errors, including insertions, deletions, transpositions, transpositions around a character and, of course, substitutions. For example, the transposition xe2x80x9ciexe2x80x9d becoming xe2x80x9ceixe2x80x9d after xe2x80x9ccxe2x80x9d can be represented as the multiple-character substitution xe2x80x9cciexe2x80x9dxe2x86x92xe2x80x9cceixe2x80x9d. Similarly, the deletion of xe2x80x9cexe2x80x9d in xe2x80x9cgeingxe2x80x9d can be represented as the multiple-character substitution xe2x80x9cgeingxe2x80x9dxe2x86x92xe2x80x9cgingxe2x80x9d. Rules can have wildcards, negation and disjunction, but this is not handled in the initial rule-generation phase.
Since different errors may generalize to the same sets of rules, duplicate rules are eliminated. Rules are also eliminated according to several heuristics. The number of times the left-hand side of the rule matches errors in the error corpus is examined. If more of the matches would fail to correct the error than successfully correct the error, the rule is discarded. This heuristic is equivalent to requiring the ratio of successful to unsuccessful firings in the error corpus to be greater than 1, or that the unsuccessful firings represent no more than 50% of the total matches in the error corpus. This latter figure is a tunable parameter. In some sense, it reflects the precision of the rule in correcting errors correctly.
The left-hand and right-hand sides of the rule are compared with a large dictionary. If the left-hand side appears more frequently than the right-hand side, the rule is discarded. This would mean that the rule has more exceptions than potential corrections and hence is not a very productive rule.
If the number of times the rule successfully matches and corrects an error in the error corpus is too low, the rule is discarded. The goal of this heuristic is to have rules that successfully account for as much of the error corpus as possible (i.e., maximize the rule""s coverage of the corpus). Given that the corpus represents a sample of the distribution of errors in real life, rules that match more of the corpus will fire more frequently. This effectively minimizes the number of rules required to correct as many errors as possible. It also maximizes the likelihood that the rules reflect general types of errors, instead of just memorizing the specific errors found in the error corpus.
If the number of times the right-hand side of the rule matches words in the dictionary is too low, the rule is discarded. The goal of this heuristic is to have rules that can potentially correct a very large number of possible errors. After all, if a rule can correct only one potential error, it would be better to list that error explicitly than to use a rule.
If the number of times the rule matches the errors in the corpus but fails to successfully correct the error is too large, the rule is discarded. The goal of this heuristic is to obtain rules that pinpoint the nature of the error precisely. Failing to correct errors successfully is an indication of a poor quality rule. A rule that makes many mistakes will require not just exceptions that correspond to words in the dictionary, but also exceptions that correspond to errors. The number of such exceptions should be minimized to reduce the complexity of the rules.
If the number of times the left-hand side of the rule matches words in the dictionary is too high, the rule is discarded. The goal of this heuristic is to minimize the likelihood that the rule will introduce errors into words that are correct. Such words must be included in an exception list for the rule, and such exception lists must be kept short. If the exception list is too long, it is an indication of a poor quality rule. This effectively minimizes the number of exceptions to the rules.
If the left-hand side of the rule matches the right-hand side of the rule, it is discarded. The reason for this heuristic is that such rules match the results of applying a correction, and so will not terminate if applied iteratively. Such a rule would have to include the right-hand side on its exception list. (This heuristic is redundant because such rules will fail the second heuristic listed above.)
If two rules correct the same collection of errors, the rule with the lower ratio of exceptions to right-hand side dictionary matches is preferred. The purpose of this heuristic is to eliminate rules that are too general.
It is important to limit the number of rules in applications where memory is at a premium, such as hand-held computers like the xe2x80x9cPalm Pilotxe2x80x9d. The xe2x80x9cPalm Pilotxe2x80x9d has only 1 Mb of memory, so we had to limit the number of rules to fit in about 5%-6% of the memory. (A dictionary based spelling correction system would require 1 Mb just for the dictionary.)
After the rules are pruned, a fixed number xe2x80x9cnxe2x80x9d of the rules will be selected. The goal is to select the n-element subset of rules which maximizes the coverage of the rules (the number of error-correction pairs accounted for in the error corpus) while minimizing the number of exceptions. This is accomplished using stochastic search methods.
Another rule-design method comprises optimizing a different measure of rule collection quality, such as maximizing the dictionary coverage of the right-hand side of the rules while minimizing the number of exceptions or minimizing rule length, or maximizing the error corpus coverage of the left-hand side of the rule. Another rule-design method comprises using a greedy algorithm to incrementally add rules to the collection based on their incremental impact on collection quality. As errors are added to the error corpus, they are examined to determine what rules, if any, should be added to the rule collection. If a rule does not fail any of the pruning tests and increases the dictionary coverage of the collection without adding too many exceptions, it is added to the collection. In other words, if a new rule is of sufficient quality and does not overlap too much with the current rule collection, it is added to the collection. The shortest rules are most preferred.
The second rule design method is similar in design to the first method but adds a few more pruning rules. The number of times the right-hand side of the rule matches corrections in the error corpus is examined. The rule is inverted and applied to the correction to generate an error which is then compared with the actual error. If the ratio of the number of times the generated error matches the actual error to the number of times it does not is less than one, the rule is discarded. This is equivalent to requiring the rule to account for at least 50% of the corrections it matches in the corpus. This latter figure is a tunable parameter. In some sense it reflects the degree to which the rule is a generative explanation for the source of the error (i.e., a measure of the degree to which the error distribution in the error corpus reflects the action of the rule).
Rules are evaluated by comparing them with the result of adding a character of context to either side of the patterns. Three sets are formed. The first set contains all dictionary words that match the right-hand side of the rule. The second set contains all errors that match the left-hand side of the rule. A subset of the first set is obtained by examining which letters appear one character to the left of the left-hand side pattern in the words in the second set, and finding all elements of the first set that match the extended patterns. These elements are joined by the words in the first set that match the characters that appear one character to the right of the left-hand side pattern in the words in the second set. Together these words form the third set. If the ratio of the number of elements in the third set to the number of elements in the first set is less than 75%, the rule is discarded as being too general. In essence, this heuristic measures the generative coverage of the rule relative to the dictionary, requiring the distribution of errors in the error corpus to be close to the distribution that would be predicted by applying the inverse of the rule, at least in an aggregate sense. If there is a large (more than 25%) group of dictionary words whose corresponding errors do not have representatives in the error corpus, this suggests that the rule does not correctly account for the cause of the errors by generalizing the errors too much.
This method can also provide a deterministic procedure for generating a rule from an example error. One starts with the smallest possible rule and adds characters to the left and/or right of the pattern (e.g., via a dynamic programming algorithm) until the resulting rules are no longer discarded as unacceptable. This gives a xe2x80x9cfringexe2x80x9d of possible rules that can be evaluated by the rule preferences described above.