This invention relates to digital data processing methods and means for separating acceptable spellings of words from nonacceptable spellings.
Methods have been proposed for using a programmed digital computer for correcting spelling errors. Efficiency is obtained by limiting the class of spelling errors to be considered. In this regard it has been found that over 80% of spelling errors fall into one of four classes of single error, namely, substitution of a character, deletion of a character, insertion of a character, or transposition of adjacent characters.
A specific computer program implemented method for correcting spelling errors is discussed in the Communications of the ACM, Volume 13, No, 2, February 1970, in an article entitled "Spelling Correction in Systems Programs" by Howard L. Morgan, pp. 90-94. This article discusses a first case where the entry word and the dictionary word are the same length. The method involves comparing the word in question against a dictionary of correctly spelled words. In this situation the method involves taking an exclusive OR of the word in question and the correctly spelled words and examining the nonzero postions in the result. If a character position is the same in both the word in question and the dictionary word, an exclusive OR will produce a zero result. If there is exactly one nonzero position, a substitution error has occurred. If there are more than two nonzero positions, the two positions are checked for equality. If they are equal, the presence of a transposition is checked in the word in question and in the dictionary word, at these positions, and the correction is made.
A second case occurs when the length of the two words differs by one character. In this situation an exclusive OR is used to find the first nonequal position, starting at the left. Subsequently the remaining parts of the word in question and the dictionary are aligned from the right and checked again by using the exclusive OR. If the two words match from the right down to the same unequal position, a single missing or added letter misspelling has been found at the position in question.
The aforementioned method discussed in the February 1970 Communications of the ACM suffers from the following disadvantages:
It must search all words in the dictionary.
It does not cope with candidates larger than query size plus or minus one, which can be valid misspellings of inflected forms of the query word.
It does not classify multiple errors.
Another technique is disclosed in Communications of the ACM, Volume 7, No. 3, March 1964, in an article entitled "A Technique for Computer Detection and Correction of Spelling Errors" by Fred J. Damerau, pp. 171-176. Again the word in question is compared against a dictionary of correctly spelled words. The search is accelerated by performing a comparison of words only when the character counts of the word in question and the word from the dictionary are the same length. If the word is found the processing is terminated and the word in question is assumed to be correctly spelled.
If the word in question is not found, the program again searches the dictionary this time using spelling correction rules. If the difference between the two words is greater than 1, the word in question cannot be an acceptable misspelling of the dictionary word. If the two words are equal or differ by only one character or if the character register is different in more than 2 bit positions, no comparison is possible and no further comparison is required.
If after the foregoing a match is still considered possible, the two words are compared position-for-position If the number of characters in the two words are the same and the words differ in only one character position, it is assumed that the two words are the same. If the two words differ in two adjacent character positions, the two characters of the word in question are interchanged and compared to the same two characters of the dictionary word and, if a match results, the two words are assumed to be the same. For all other cases of equal character length, a no-match condition is assumed. The next dictionary entry is compared to the word in question.
If the word in question is a character longer than the dictionary word, the first difference character of the word in question is discarded and the remaining characters are shifted left one position. If a match in all positions occurs, it is assumed that the words are the same. If the dictionary word is a character longer, the first difference character of the dictionary word is discarded and the words are compared the same as above.
The method is repeated until a match of the word in question is found or until all entries in the dictionary have been tested.
The method disclosed in the Communications of the ACM of March 1964 suffers from the following disadvantages:
It must search the entire dictionary.
It cannot handle candidate words larger than +1 greater than the query word.
The Communications of the ACM, December 1980, Volume 23, No. 12, contains an article entitled "Computer Programs for Detecting and Correcting Spelling Errors". This also makes reference to the fact that 80% of all spelling errors are of one of four classes. The article goes on to discuss a method whereby misspelled words in a dictionary may be matched against a token (i.e., word) which is being searched in the dictionary. In this regard, transpositions are detected by transposing each pair of adjacent characters in the token, one at a time, and searching for the resultant token. One extra letter is handled by deleting each character in the token, one at a time, and searching for the resultant tokens. One missing letter and one wrong letter are handled by one of two strategies. One of the approaches suggested is to substitute each potential character in each character position and search for the token in the dictionary. The problem with this approach is that a very large number of repeated searches is required. Another suggested approach is to create a table of tokens which are, in effect, acceptable misspellings of words. When a questionable word is encountered, one must search the table of acceptable misspellings to locate a token which matches the one in question. Obviously where a very large number of acceptable misspellings is involved, the search can be quite tedious.
Another approach discussed in that article involves the frequency of two letter pairs and three letter triples to detect potential misspellings in order to form an index into a table of acceptable misspellings.
Other approaches that have been taken involve socalled interactive spelling checkers. In this regard, each word is checked against a dictionary of correctly spelled words and, if the word is not in the dictionary, the user is asked what to do. This approach obviously does not provide any type of automatic matching of misspelled words. Another technique employed is to take tokens and convert them into standard phonetic spelling and to find similar sounding words in a dictionary. This, for example, works well with double errors using, for example, "f" for "ph" or "k" for "qu."
The problem of programming a computer to determine whether or not a string of characters is an acceptable misspelling of a given word has been widely considered. See for example "String Similarity and Misspellings" by Cyril N. Alberga, Communications of the ACM, Volume 10, No. 5, May 1967, pp. 302-313; "Approximate String Matching" by Patrick A. V. Hall and Geoff R. Dowling, Computing Surveys, Volume 12, No. 4, December 1980, pp. 381-402; and "Computer Programs for Detecting and Correcting Spelling Errors" by James L. Peterson, Communications of ACM, Volume 23, No. 12, December 1980, pp. 676-687.
These prior art approaches to handling misspelling are generally slow and inefficient for large data bases.