The present invention relates to natural language processing. In particular, the present invention relates to grammar checker processing of natural language text.
A computer program that checks a user's grammar for correctness is called a grammar checker. Upon finding a mistake, a grammar checker usually flags the error to the user and suggests a correction. Users find grammar checkers to be very helpful. However the usefulness is dependent on the quality of the corrections the grammar checker suggests. The higher the accuracy, the happier the user.
In various grammar checkers, there are some mistakes that are difficult to evaluate using just heuristics. One such mistake is agreement between subject and verb. For example, the subject and verb in “He very quickly, after turning out the lights, eat the pistachios” do not agree. In this example, the subject and the verb are separated by a long distance, and so larger structures need to be considered. In the example “I insist that he go”, if one just looks at ‘he go’ it would seem that there is disagreement. However, because it is an argument clause to ‘insist’ there is no disagreement. In this case again one needs to consider larger scale structures. In many cases the correct larger scale structures to use are parts of the complete parse tree for the sentence.
There are other types of hard mistakes, such as writing ‘their’ where ‘there’ is meant. Again, for this type of mistake the best way to know if a mistake has been made is often to look at the parse tree for the entire sentence. Another hard type of mistake is either not forming possessives correctly, or using a possessive where a plural was meant.
There are various heuristics grammar checkers can use to identify such mistakes. Most of these involve some form of template matching. For example one might construct a template that says if ‘their’ is followed by ‘is’, then change it to ‘there’. Some grammar checkers have used probabilistic techniques using as evidence the words within a certain distance of the search site. Some use Hidden Markov Model techniques to identify errors. Some techniques use a parser to help identify mistakes. In all of these methods there are two distinct processes going on. The first process is the search for constructions that might be in error. The second process is the evaluation of the choices to see if an error was actually made. Both of these processes are error prone.
Although these techniques for grammar checking natural language text have proven useful, there is an ongoing need to further improve the quality of the corrections suggested by the grammar checker. In particular, there is an ongoing need to improve the evaluation processes of grammar checkers.
Overview of Natural Language Processing
An overview of natural language processing (NLP) and related concepts is provided to aid in the understanding of the concepts of the invention. A NLP system is typically a computer-implemented software system, which intelligently derives meaning and context from an input string of natural language text. “Natural languages” are the imprecise languages that are spoken by humans (e.g., English, French, Japanese). Without specialized assistance, computers cannot distinguish linguistic characteristics of natural language text. A NLP system assists the computer in distinguishing how words are used in different contexts and in applying rules to construct intelligible language.
NLP Parser
The core of a NLP system is its parser. Generally, a parser breaks an utterance (such as a phrase or sentence) down into its component parts with an explanation of the form, function, and syntactical relationship of each part. The NLP parser takes a phrase and builds for the computer a representation of the syntax of the phrase that the computer can understand. A parser may produce multiple different representations for a given phrase. The representation makes explicit the role each word plays and the relationships between the words. As used herein, an utterance is equivalent to a phrase. A phase is a sequence of words intended to have meaning. In addition, a sentence is understood to be one or more phrases. In addition, references herein to a human speaker include a writer and speech includes writing.
FIG. 1 shows a NLP parser 20 of a typical NLP system. The parser 20 has four key components: tokenizer 28; grammar rules interpreter 26; searcher 30; and parse ranker 34. The parser 20 receives a textual string 22. Typically, this is a sentence or a phrase. The parser also receives grammar rules 24. These rules attempt to codify and interpret the actual grammar rules of a particular natural language, such as English. Alternatively, these rules may be stored in memory within the parser.
The grammar rules interpreter 26 interprets the codified grammar rules. The tokenizer 28 identifies the words in the textual string 22, looks them up in a dictionary, makes records for the parts of speech (POS) of a word, and passes these to the searcher. The searcher 30 in cooperation with the grammar rules interpreter generates multiple grammatically correct parses of the textual string. The searcher sends its results to the parse ranker 34.
The parse ranker 34 mathematically measures the “goodness” of each parse and ranks them. “Goodness” is a measure of the likelihood that such a parse represents the intended meaning of the human speaker (or writer). The ranked output of the parser ranker is the output of the ranker. This output is one or more of parses 38 ranked from most to least goodness.
Linguistic Concepts of NLP
Linguists group words of a language into classes, which show similar syntactic behavior, and often a typical semantic type. These word classes are otherwise called “syntactic” or “grammatical categories”, but more commonly still by the traditional names “part of speech” (POS). For example, common POS categories for English include noun, verb, adjective, preposition, and adverb.
Generally, words are organized into phrases, which are groupings of words that are clumped as a unit. Syntax is the study of the regularities and constraints of word order and phrase structure. Among the major phrase types are noun phrases, verb phrases, prepositional phrases, and adjective phrases.
The headword is the key word in a phrase. This is because it determines the syntactic character of a phrase. In a noun phrase, the headword is the noun. In a verb phrase, it is the main verb. For example, in the noun phrase “red book”, the headword is “book.” Similarly, for the verb phrase “going to the big store”, the headword is “going.” A modifying headword is the headword of a sub-phrase within a phrase where the sub-phrase modifies the main headword of the main phrase. Assume a phrase (P) has a headword (hwP) and a modifying sub-phrase (M) within the P that modifies hwP. The modifying headword (hwM) is the headword of this modify phrase (M).
Syntactic features are distinctive properties of a word relating to how the word is used syntactically. For example, the syntactic features of a noun include whether it is singular (e.g. cat) or plural (e.g. cats) and whether it is countable (e.g. five forks) or uncountable (e.g. air). The syntactic feature of a verb includes whether or not it takes an object, for example.
Computational Linguistics
In computational linguistics, the regularities of a natural language's word order and grammar are often captured by a set of rules called “transitions” or “rewrite rules.” The rewrite rules are a computer representation of rules of grammar. These transitions are used to parse a phrase. A rewrite rule has the notation form: “symbolA→symbolB symbolC . . . ”. This indicates that symbol (symbolA) on the left side of the rule may be rewritten as one or more symbols (symbolB, symbolC, etc.) on the right side of the rule.
For example, symbolA may be “s” to indicate the “start” of the sentence analysis. SymbolB may be “np” for noun phrase and symbolC may be “vp” for verb phrase. The “np” and “vp” symbols may be further broken down until the actual words in the sentence are represented by symbolB, symbolC, etc. For convenience, transitions can be named so that the entire rule need not be recited each time a particular transition is referenced.
The nature of the rewrite rules is that a certain syntactic category (e.g, noun, np, vp, pp) can be rewritten as one or more other syntactic categories or words. The possibilities for rewriting depend solely on the category, and not on any surrounding context, so such phrase structure grammars are commonly referred to as context-free grammars (CFG).
FIG. 2 illustrates a CFG parse tree 50 of a phrase (or sentence). This tree-like representation of the sentence “flies like ants” is deconstructed using a CFG set of rewrite rules (i.e, transitions). The tree 50 has leaf nodes (such as 52a–52c and 54a–54g.)
The tree 50 includes a set of terminal nodes 52a–52c. These nodes are at the end of each branch of the tree and cannot be further expanded. For example, “like” 52b cannot be expanded any further because it is the word itself. The tree 50 also includes a set of non-terminal nodes 54a–54g. These nodes are internal and may be further expanded. Each non-terminal node has immediate children, which form a branch (i.e., “local tree”). Each branch corresponds to the application of a transition. For example, “np” 54b can be further expanded into a “noun” by application of the “np_noun” transition.
Each non-terminal node in the parse tree is created via the application of some rewrite rule. For example, in FIG. 2, the root node 54a was created by the “s→np vp” rule. The “VP” node 54d by the “s→verb np” rule. The tree 50 has a non-terminal node 54a designated as the starting node and it is labeled “s.” In general, the order of the children in each branch generates the word order of the sentence, and the tree has a single root node (in FIG. 2 it is node 54a), which is the start of the parse tree.
A non-terminal node has a type that is called its “segtype.” In FIG. 2, each non-terminal node 54a–g is labeled with its segtype. A node's segtype identifies the rule that was used to create the node (working up from the terminal nodes). For example, the segtype of node 54b in FIG. 2 is “np” because the rule “np→noun” was used to create the node.
In given grammar, a segtype can be many different values including, for example: NOUN, NP (noun phrase), VERB, VP (verb phrase), ADJ (adjective), ADJP (adjective phrase), ADV (adverb), PREP (preposition), PP (prepositional phrase), INFCL (infinitive clauses), PRPRT (present participial clause) PTPRT (past participial clause), RELCL (relative clauses), and AVPVP (a verb phrase that has a verb phrase as its head).
In this document, a functional notation is used to refer to the information associated with a node. For example, if a variable “n” represents a node in the tree, then “hw(n)” is the headword of node “n.” The following functions are used through out this document:                hw(n) is the headword of node n        segtype(n) is the segtype of node n        trans(n) is the transition (rewrite rule) associated with node n        trn(n) is the name of the transition        modhw(n) is the modifying headword of node n        
A parse tree can be annotated with information computed during the parsing process. A common form of this is the lexicalized parse tree where each node is annotated with its headword. One can annotate a parse tree with additional linguistic information (e.g. syntactic features). FIG. 3 shows an example of such a lexicalized parse tree 60. FIG. 3 is a parse tree of one or many parses of the sentence, “swat flies like ants.” Terminal nodes 62a–d, which are the words of the sentence, are not annotated. Non-terminal nodes 64a–i are annotated. For example, node 64h has a segtype of “noun” and is annotated with “hw=ants”. This means that its headword is “ants.” The parse tree 60 in FIG. 3 is also annotated with the names of the transitions between nodes. For example, the transition name “vp_verbvp” is listed between node 64f and node 64h. 
A probabilistic context free grammar (PCFG) is a context free grammar where every transition is assigned a probability from zero to one. PCFGs have commonly been used to define a parser's “goodness” function. “Goodness” is a calculated measurement of the likelihood that a parse represents the intended meaning of the human speaker. In a PCFG, trees containing transitions that are more probable are preferred over trees that contain less probable transitions.
Since the probability of a transition occurring cannot be mathematically derived, the standard approach is to estimate the probabilities based upon a training corpus. A training corpus is a body of sentences and phrases that are intended to represent “typical” human speech in a natural language. The speech may be intended to be “typical” for general applications, specific applications, and/or customized applications. This “training corpus” may also be called “training data.”
An augmented phrase structured grammar (APSG) is a CFG that gives multiple names to each rule, thereby limiting the application of each “named” rule. Thus, for each given rewrite rule there are more than one name and the name limits its use to specific and narrower situations. For example, the structure “VP→NP VP” may have these limiting labels: “SubjPQuant” and “VPwNPl.” SubjPQuant specifies subject post-quantifiers on a verb phrase. For example, in “all found useful . . . ”, “all” is a subject post-quantifier, In “we all found useful the guidelines” is [NP all][VP found useful the guidelines]. VPwNPl specifies a subject to a verb phrase. For example, in “John hit the ball” [NP John] [VP hit the ball] where John is the subject.
Given the ambiguity that exists in natural languages, many sentences have multiple syntactic interpretations. The different syntactic interpretations generally have different semantic interpretations. In other words, a sentence has more than one grammatically valid structure (“syntactic interpretation”) and as a result, may have more than one reasonable meaning (“semantic interpretation”) A classic example of this is the sentence, “time flies like an arrow.” There are seven valid syntactic parse trees.
FIGS. 4a and 4b show examples of two of the seven valid parses of this sentence. For the parse tree 70 of FIG. 4a, the object “time” 74 moves in a way that is similar to an arrow. For the parse tree 80 of FIG. 4b, the insects called “time flies” 84 enjoy the arrow object; just as one would say “Fruit flies like a meal.” Either parse could be what the speaker intended. In addition, five other syntactically valid parses may represent the meaning that the speaker intended.
A conventional approach used in NLP systems to determine which of multiple grammatically correct parses is the “correct” one is the use of a “goodness” function to calculate a “goodness measure” of each valid parse. Existing parsers differ in the extent to which they rely on a goodness function, but most parsers utilize one. A number of different goodness measures have been used in natural language systems to rank parse trees. For example goodness measures based upon probabilities determined by how frequently given parse tree occurred in a training corpus (a “straw man” approach) have been used. Other goodness measures use a collection of mostly unrelated statistical calculations based upon parts of speech, syntactic features, word probabilities, and selected heuristic rules. Still other goodness measures are based upon syntactic bigram approaches, transition probability approaches (TPA), or other methods.