As discussed in U.S. Pat. No. 4,868,750 issued to Henry Kucera et al, a colloquial grammar checking system involves automated language analysis via a computer for receiving digitally encoded text composed in a natural language and using a stored dictionary of words and analysis and an analysis program to analyze the encoded text and to identify errors. In particular such a system is utilized in the Microsoft Word program for detecting grammar errors.
One of the most troublesome problems associated with such systems is extremely high error rate when the system suggests a proper usage. The reason for the unreasonably high error rate derives from the system's incorrect analysis of a sentence. Also assuming a correct analysis of a sentence the Microsoft system often suggests an incorrect word.
There is also a class of systems which attempt to analyze a sentence based on the probability that the entire sentence is correct. The largest problem with such systems is that they require storage and processing power beyond the capability of present PCs and related memories.
Other systems attempt to detect incorrect grammar by analyzing sentences based on a training corpus. However, system constraints preclude this type of system from being utilizable in personal computing environments due to the massive storage involved as well as high speed processing required.
By way of example, prior grammar checking systems routinely miss inserting indefinite articles such as "a" and "an", which is indeed a large problem for foreign speaking individuals when trying to translate into the natural language presented by the system.
Also of tremendous importance is the lack of ability to insert the appropriate article such as "the" or "a" when sentences are composed by those not familiar either with the grammar rules or with the colloquial usage of such articles. Moreover, common mistakes made by prior art grammar checking systems include no recognition of incorrect verb sequences in which multiple verbs are used. Although multiple verbs can be used properly in a sentence, most foreign speaking individuals routinely make mistakes such as "He has recognize that something exists." Here "has" is a verb and "recognize" is a verb. As can be seen there is an obvious misusage of multiple verbs.
Most importantly, problems occur in so-called determiners such that for instance the sentence "I have cigarette" obviously is missing the determiner "a". Likewise there are often missing determiners such as "some" or "a few". Thus a proper sentence could have read "I have a few cigarettes". Note that the same sentence could properly be constructed by putting the noun in plural form, e.g. "I have a few cigarettes"; or "I have cigarettes".
An even further typical grammar error not corrected by either spell checkers or prior grammar systems includes the failure to correct improper word inflection. For instance as to improper verb inflections, such systems rarely correct a sentence such as "I drived to the market."
The above problems become paramount when taken from the view of a non-native speaker unfamiliar both with the idiom and the rules of the language. Especially with English, the rules are not as straightforward as one would like, with the correct "grammar" often determined by idiom or rules which are not familiar to those native speakers utilizing the language.
It is therefore important to provide a grammar checking system which takes into account the most frequent errors made by non-native speakers of a particular nationality. Thus for instance there is a body of errors normally made by Japanese native speakers which are translated into English in ways which are predictable and therefore correctable. Likewise for instance for French or any of the Romance languages, there are certain characteristic errors made when translating into English which can be detected and corrected.
Syntax recognizing systems have in general been limited to operating on text having a small, well-defined vocabulary, or to operating on more general text but dealing with a limited range of syntactic features. Extensions of either vocabulary or syntactic range require increasingly complex structures and an increasing number of special recognition rules, which make a system too large or unwieldy for commercial implementation on commonly available computing systems.
Another popular system for detecting and correcting contextual errors in a text processing system is described U.S. Pat. No. 4,674,065 issued to Frederick B. Lang et al, in which a system for proofreading a document for word use validation and text processing is accomplished by coupling a specialized dictionary of sets of homophones and confusable words to sets of di-gram and n-gram conditions from which proper usage of the words can be statistically determined. As mentioned before, doing statistics on words as opposed to parts of speech requires an exceptionally large training corpus and high speed computation, making the system somewhat unwieldy for personal computing applications. Moreover, this system, while detecting confusable words in terms of like-sounding words, is not sufficient to provide correction for those words which are confused in general usage but which do not sound alike.
Finally, U.S. Pat. No. 4,830,521 is a patent relating to an electronic typewriter with a spell checking function and proper noun recognition. It will be appreciated that the problem with noun recognition revolves around a capitalization scenario which may or may not be accurate in the recognition of a proper noun. Most importantly this patent tests words only to find if they are the first word in a sentence to determine the function of the capitalization, whereas capitalization can obviously occur for words anywhere in the sentence.
By way of further background numerous patents attack the grammar problem first through the use of spelling correction. Such patents include U.S. Pat. Nos. 5,218,536; 5,215,388; 5,203,705; 5,161,245; 5,148,367; 4,995,740; 4,980,855; 4,915,546; 4,912,671; 4,903,206; 4,887,920; 4,887,212; 4,873,634; 4,862,408; 4,852,003; 4,842,428; 4,829,472; 4,799,191; 4,799,188; 4,797,855; and 4,689,768.
There are also a number of patents dealing with text analysis such as U.S. Pat. Nos. 5,224,038; 5,220,503; 5,200,893; 5,164,899; 5,111,389; 5,029,085; 5,083,268; 5,068,789; 5,007,019; 4,994,966; 4,974,195; 4,958,285; 4,933,896; 4,914,590; 4,816,994; and 4,773,009. It will be appreciated that all of these patents relate to systems that cannot be practically implemented for the purpose of checking grammar to the levels required especially by those non-native speakers who are forced to provide written documents in a given natural language. It will also be appreciated that these patents relate to general systems which are not specifically directed to correcting grammar and English usage for non-native speakers.
Finally there exists a number of patents which relate to how efficiently one can encode a dictionary, these patents being U.S. Pat. Nos. 5,189,610; 5,060,154; 4,959,785; and 4,782,464. It will be appreciated that encoding a dictionary is but one step in formulating a system which can adequately check grammar.
Of particular importance in grammar checking is the ability to detect the sequence of parts of speech as they exist in a given sentence. Correct sentences will have parts of speech which follow a normal sequence, such that by analyzing the parts of speech sequence one can detect the probability that the sentence is correct in terms of its grammar. While prior art systems have tagged a sentence for parts of speech and have analyzed the sequences of parts of speech for the above mentioned probability, these probability have never been utilized in grammar checking and correcting system.