A number of techniques exist for creating a lexicon by compiling words from a number of documents, such as the Internet. A significant problem with such techniques, however, is that the source documents often contain a number of errors that are introduced into the lexicon, which is desired to be error-free Thus, it is desired to remove such errors from the lexicon being created.
A number of techniques exist for automatically detecting spelling errors Suppose that a spell checking algorithm is given a word, G, such as a possibly misspelled word, and attempts to find one or more other words from a list of candidate words (such as validly spelled words) that are within a given edit distance from G. The edit distance between two words is the smallest number of fundamental operations that transform the candidate word into the given word (with each fundamental operation, for example, consisting of removing one letter (deletion), adding one letter (insertion), replacing one letter with another letter (replacement), or transposing two letters (transposition))
Two words are said to have a distance (or “edit distance”) of zero between them if they are identical. Given the above definition of “fundamental operation” the two words are said have a distance one separation if one can get from one word to the other word, by: (1) transposing one pair of adjacent characters; (2) replacing a single character with any other character; (3) deleting any one character; or (4) inserting an arbitrary character at any position in the original word. Likewise, words are a distance two apart if two operations of the type described above are required to get from the first word to the second word More generally, two words are a distance N apart if N operations are required to get from the first word to the second.
Word processors typically perform spelling correction using a lexicon that is not derived from a user's collection of documents. Thus, when a user starts using a word processor, and encounters words that are not found in the provided lexicon, such as company acronyms and product names, the unfound words are initially flagged as misspellings (until, and if, the user adds the words to his or her personal lexicon) If, however, the lexicon were instead created by sifting through the existing documents of the user, or a work group associated with the user, this effort could be saved.
Nonetheless, the documents of the user or work group would typically contain a number of errors that should not be included in the lexicon. A need therefore exists for improved techniques for automatically detecting spelling errors in one or more documents