This patent application refers to material comprising a portion of a computer program listing presented as an appendix on CD. The file on the accompanying CD entitled Appendix for Spelling and Grammar Checking System.doc, created May 8, 2001, size 80,384 bytes, on the CD is incorporated herein by reference. The file includes three appendices, entitled Appendix A, Appendix B, and Appendix C.
1. Field of the Invention
The present invention relates generally to a spelling and grammar checking system, and more particularly to a spelling and grammar checking system which corrects misspelled words, incorrectly-used words, and contextual and grammatical errors. The invention has particular utility in connection with machine translation systems, word processing systems, and text indexing and retrieval systems such as World Wide Web search engines.
2. Description of the Related Art
Conventional spelling correction systems, such as those found in most common word processing applications, check whether each word in a document is found in a dictionary database. When a word is not found in the dictionary, the word is flagged as being incorrectly spelled. Suggestions for replacing the incorrectly-spelled word with its correctly-spelled counterpart are then determined by inserting, deleting and/or transposing characters in the misspelled word. For example, in a sentence like My son thre a ball at me, the word thre is not correctly-spelled. Conventional spelling correction systems, such as those described in U.S. Pat. No. 4,580,241 (Kucera) and U.S. Pat. No. 4,730,269 (Kucera), suggest words such as threw, three, there and the, as possible alternatives for the misspelled word by adding and deleting characters at different locations in the misspelled word. These alternative words are then displayed to a user, who must then select one of the alternatives.
One of the drawbacks of conventional systems is that they lack the ability to suggest alternative words based on the context in which the misspelled word appears. For example, in the following three sentences, the word thre appears in different contexts and, therefore, should be corrected differently in each sentence.
My son thre a ball through the window.
He broke thre window.
He moved thre years ago.
More specifically, in the first sentence, the incorrectly-spelled word thre should be replaced by threw. In the second sentence, the word thre should be replaced by the. In the third sentence, the word thre should be replaced by three. In spite of these differences in context, conventional spelling correction systems suggest the same list of alternative words, ranked in the same order, for all three of the foregoing sentences. For example, the spelling correction program provided in Microsoft(copyright) Word ""97 suggests the following words, in the following order, for all three of the foregoing sentences: three, there, the, throe, threw.
Since conventional spelling correction systems do not rank alternative words according to context, such systems are not able to correct spelling mistakes automatically, since to do so often leads to an inordinate number of incorrectly corrected words. Rather, such systems typically use an interactive approach to correcting misspelled words. While such an approach can be effective, it is inefficient, and oftentimes very slow, particularly when large documents are involved. Accordingly, there exists a need for a spell checking system which is capable of ranking alternative words according to context, and which is also capable of automatically correcting misspelled words without significant user intervention.
Conventional spelling correction systems are also unable to correct grammatical errors in a document or other input text, particularly if those words are spelled correctly but are misused in context. By way of example, although the word too is misused in the sentence He would like too go home, conventional spelling correction systems would not change too to to, since too is correctly spelled. In this regard, grammar checking systems are available which correct improperly used words (see, e.g., U.S. Pat. No. 4,674,065 (Lange), U.S. Pat. No. 5,258,909 (Damerau), U.S. Pat. No. 5,537,317 (Schabes), U.S. Pat. No. 4,672,571 (Bass), and U.S. Pat. No. 4,847,766 (McRae)). Such systems, however, are of limited use, since they are only capable of correcting relatively short lists of predefined words. More importantly, such systems are not capable of performing grammar corrections on words that have been misspelled.
Accordingly, there exists a need for a spelling and grammar checking system which is capable of correcting words that have misused in a given context in cases where the words have been spelled incorrectly and in cases where the words have been spelled correctly.
The present invention addresses the foregoing needs by providing a system which corrects both the spelling and grammar of words using finite state machines, such as finite state transducers and finite state automata. For each word in a text sequence, the present invention provides a list of alternative words ranked according to a context of the text sequence, and then uses this list to correct words in the text (either interactively or automatically). The invention has a variety of uses, and is of particular use in the fields of word processing, machine translation, text indexing and retrieval, and optical character recognition, to name a few.
In brief, the present invention determines alternatives for misspelled words, and ranks these alternatives based on a context in which the misspelled word occurs. For example, for the sentence My son thre a ball through the window, the present invention suggests the word threw as the best correction for the word thre, whereas for the sentence He broke thre window, the present invention suggests the word the as the best correction for the word thre. In its interactive mode, the invention displays alternative word suggestions to a user and then corrects misspelled words in response to a user""s selection of an alternative word. In contrast, in its automatic mode, the present invention determines, on its own, which of the alternatives should be used, and then implements any necessary corrections automatically (i.e., without user input).
Advantageously, the invention also addresses incorrect word usage in the same manner that it addresses misspelled words. Thus, the invention can be used to correct improper use of commonly-confused words such as who and whom, homophones such as then and than, and other such words that are spelled correctly, but that are improper in context. For example, the invention will correct the sentence He thre the ball to the sentence He threw the ball (and not three, the, . . . ); the sentence fragment flight smulator to flight simulator (and not stimulator); the sentence fragment air baze to air base (and not baize, bass, babe, or bade); the phrase Thre Miles Island to Three Miles Island (and not The or Threw); and the phrase ar traffic controller to air traffic controller (and not are, arc, . . . ). The invention also can be used to restore accents (such as xc3xa1, à, xc3xa9, . . . ) or diacritic marks (such as xc3x1, xc3xa7, . . . ) in languages such as French and Spanish. For example, the current invention corrects the sentence il l""a releve to il l""a relevxc3xa9 (and not relxc3xa8ve, relxc3xa8vent, . . . ).
According to one aspect, the present invention is a system (i.e., an apparatus, a method and/or computer-executable process steps) for correcting misspelled words in input text. The system detects a misspelled word in the input text, and determines a list of alternative words for the misspelled word. The list of alternative words is then ranked based on a context of the input text.
According to another aspect, the present invention is a word processing system for creating and editing text documents. The word processing system inputs text into a text document, spell-checks the text so as to replace misspelled words in the text with correctly-spelled words, and outputs the document. The spell-checking performed by the system comprises detecting misspelled words in the text, and, for each misspelled word, determining a list of alternative words for the misspelled word, ranking the list of alternative words based on a context in the text, selecting one of the alternative words from the list, and replacing the misspelled word in the text with the selected one of the alternative words.
According to another aspect, the present invention is a machine translation system for translating text from a first language into a second language. The machine translation system inputs text in the first language, spell-checks the text in the first language so as to replace misspelled words in the text with correctly-spelled words, translates the text from the first language into the second language, and outputs translated text. The spell-checking performed by the system comprises detecting misspelled words in the text, and, for each misspelled word, determining a list of alternative words for the misspelled word, ranking the list of alternative words based on a context in the text, selecting one of the alternative words from the list, and replacing the misspelled word in the document with the selected one of the alternative words.
According to another aspect, the present invention is a machine translation system for translating text from a first language into a second language. The machine translation system inputs text in the first language, translates the text from the first language into the second language, spell-checks the text in the second language so as to replace misspelled words in the text with correctly-spelled words, and outputs the text. The spell-checking performed by the system comprises detecting misspelled words in the text, and, for each misspelled word, determining a list of alternative words for the misspelled word, ranking the list of alternative words based on a context in the text, selecting one of the alternative words from the list, and replacing the misspelled word in the document with the selected one of the alternative words.
According to another aspect, the present invention is an optical character recognition system for recognizing input character images. The optical character recognition system inputs a document image, parses character images from the document image, performs recognition processing on parsed character images so as to produce document text, spell-checks the document text so as to replace misspelled words in the document text with correctly-spelled words, and outputs the document text. The spell-checking performed by the system comprises detecting misspelled words in the document text, and, for each misspelled word, determining a list of alternative words for the misspelled word, ranking the list of alternative words based on a context in the text, selecting one of the alternative words from the list, and replacing the misspelled word in the document text with the selected one of the alternative words.
According to another aspect, the present invention is a system for retrieving text from a source. The system inputs a search word, corrects a spelling of the search word to produce a corrected search word, and retrieves text from the source that includes the corrected search word.
According to another aspect, the present invention is a system for retrieving text from a source. The system inputs a search phrase comprised of a plurality of words, at least one of the plurality of words being an incorrect word, and replaces the incorrect word in the search phrase with a corrected word in order to produce a corrected search phrase. Text is then retrieved from the source based on the corrected search phrase.
According to another aspect, the present invention is a system for correcting misspelled words in input text sequences received from a plurality of different clients. The system stores, in a memory on a server, a lexicon comprised of a plurality of reference words, and receives the input text sequences from the plurality of different clients. The system then spell-checks the input text sequences using the reference words in the lexicon, and outputs spell-checked text sequences to the plurality of different clients.
According to another aspect, the present invention is a system for selecting a replacement word for an input word in a phrase. The system determines alternative words for the input word, the alternative words including at least one compound word which is comprised of two or more separate words, each alternative word having a rank associated therewith. The system then selects, as the replacement word, an alternative word having a highest rank.
According to another aspect, the present invention is a system for correcting grammatical errors in input text. The system generates a first finite state machine (xe2x80x9cFSMxe2x80x9d) for the input text, the first finite state machine including alternative words for at least one word in the input text and a rank associated with each alternative word, and adjusts the ranks in the first FSM in accordance with one or more of a plurality of predetermined grammatical rules. The system then determines which of the alternative words is grammatically correct based on the ranks associated with the alternative words, and replaces the at least one word in the input text with a grammatically-correct alternative word determined in the determining step.
According to another aspect, the present invention is a word processing system for creating and editing text documents. The word processing system inputs text into a text document, checks the document for grammatically-incorrect words, replaces grammatically-incorrect words in the document with grammatically-correct words, and outputs the document. The checking performed by the system comprises (i) generating a finite state machine (xe2x80x9cFSMxe2x80x9d) for text in the text document, the finite state machine including alternative words for at least one word in the text and a rank associated with each alternative word, (ii) adjusting the ranks in the FSM in accordance with one or more of a plurality of predetermined grammatical rules, and (iii) determining which of the alternative words is grammatically correct based on ranks for the alternative words.
According to another aspect, the present invention is a machine translation system for translating text from a first language into a second language. The machine translation system inputs text in the first language, checks the text in the first language for grammatically-incorrect words, and replaces grammatically-incorrect words in the text with grammatically-correct words. The machine translation system then translates the text with the grammatically-correct words from the first language into the second language, and outputs the text in the second language. The checking performed by the machine translation system comprises (i) generating a finite state machine (xe2x80x9cFSMxe2x80x9d) for the text in the first language, the finite state machine including alternative words for at least one word in the text and a rank associated with each alternative word, (ii) adjusting the ranks in the FSM in accordance with one or more of a plurality of predetermined grammatical rules, and (iii) determining which of the alternative words is grammatically correct based on ranks for the alternative words.
According to another aspect, the present invention is a machine translation system for translating text from a first language into a second language. The machine translation system inputs text in the first language, translates the text from the first language into the second language, checks the text in the second language for grammatically-incorrect words, replaces grammatically-incorrect words in the text with grammatically-correct words, and outputs the text with the grammatically-correct words. The checking performed by the system comprises (i) generating a finite state machine (xe2x80x9cFSMxe2x80x9d) for the text in the second language, the finite state machine including alternative words for at least one word in the text and a rank associated with each alternative word, (ii) adjusting the ranks in the FSM in accordance with one or more of a plurality of predetermined grammatical rules, and (iii) determining which of the alternative words is grammatically correct based on ranks for the alternative words.
According to another aspect, the present invention is an optical character recognition system for recognizing input character images. The optical character recognition system inputs a document image, parses character images from the document image, performs recognition processing on parsed character images so as to produce document text, checks the document text for grammatically-incorrect words, replaces grammatically-incorrect words in the document text with grammatically correct words, and outputs the document text. The checking performed by the system comprises (i) generating a finite state machine (xe2x80x9cFSMxe2x80x9d) for the document text, the finite state machine including alternative words for at least one word in the text and a rank associated with each alternative word, (ii) adjusting the ranks in the FSM in accordance with one or more of a plurality of predetermined grammatical rules, and (iii) determining which of the alternative words is grammatically correct based on ranks for the alternative words.
According to another aspect, the present invention is a system for retrieving text from a source. The system inputs a search phrase comprised of a plurality of words, at least one of the plurality of words being a grammatically-incorrect word, replaces the grammatically-incorrect word in the search phrase with a grammatically-correct word in order to produce a corrected search phrase, and retrieves text from the source based on the corrected search phrase.
According to another aspect, the present invention is a system of spell-checking input text. The system detects a misspelled word in the input text, stores one or more lexicon finite state machines (xe2x80x9cFSMxe2x80x9d) in a memory, each of the lexicon FSMs including plural reference words, generates an input FSM for the misspelled word, selects one or more reference words from the lexicon FSMs based on the input FSM, the one or more reference words substantially corresponding to a spelling of the misspelled word, and outputs selected ones of the one or more reference words.
This brief summary has been provided so that the nature of the invention may be understood quickly. A more complete understanding of the invention can be obtained by reference to the following detailed description of the preferred embodiments thereof in connection with the attached drawings.