The present invention deals with parsing text. More specifically, the present invention deals with improvements in left-corner chart parsing.
Parsing refers to the process of analyzing a text string into its component parts and categorizing those parts. This can be part of processing either artificial languages (C++, Java, HTML, XML, etc.) or natural languages (English, French, Japanese, etc.). For example, parsing the English sentence, the man with the umbrella opened the large wooden door, would normally involve recognizing that:                opened is the main verb of the sentence,        the subject of opened is the noun phrase the man with the umbrella,        the object of opened is the noun phrase the large wooden door,with the man with the umbrella and the large wooden door being further analyzed into their component parts. The fact that parsing is nontrivial is illustrated by the fact that the sentence contains the substring the umbrella opened, which in isolation could be a full sentence, but in this case is not even a complete phrase of the larger sentence.        
Parsing by computer is sometimes performed by a program that is specific to a particular language, but often a general-purpose parsing algorithm is used with a formal grammar for a specific language to parse strings in that language. That is, rather than having separate programs for parsing English and French, a single program is used to parse both languages, but it is supplied with a grammar of English to parse English text, and a grammar of French to parse French text.
Perhaps the most fundamental type of formal grammar is context-free grammar. A context-free grammar consists of terminal symbols, which are the tokens of the language; a set of nonterminal symbols, which are analyzed into sequences of terminals and other nonterminals; a set of productions, which specify the analyses; and a distinguished “top” nonterminal symbol, which specifies the strings that can stand alone as complete expressions of the language.
The productions of a context-free grammar can be expressed in the form A→X1 . . . Xn where A is a single nonterminal symbol, and X1 . . . Xn is a sequence of n terminals and/or nonterminals. The interpretation of a production A→X1 . . . Xn is that a string can be categorized by the nonterminal A if it consists of a sequence of contiguous substrings that can be categorized by X1 . . . Xn.
The goal of parsing is to find an analysis of a string of text as an instance of the top symbol of the grammar, according to the productions of the grammar. To illustrate, suppose we have the following grammar for a tiny fragment of English:                S→NP VP        NP→Name        Name→john        Name→mary        VP→V NP        V→likes        
In this grammar, terminals are all lower case, nonterminals begin with an upper case letter, and S is the distinguished top symbol of the grammar. The productions can be read as saying that a sentence can consist of a noun phrase followed by a verb phrase, a noun phrase can consist of a name, john and mary can be names, a verb phrase can consist of a verb followed by a noun phrase, and likes can be a verb. It should be easy to see that the string john likes mary can be analyzed as a complete sentence of the language defined by this grammar according the following structure:                (S: (NP: (Name: john))                    (VP: (V: likes)                            (NP: (Name: mary))))                                                
For parsing natural language, often grammar formalisms are used that augment context-free grammar in some way, such as adding features to the nonterminal symbols of the grammar, and providing a mechanism to propagate and test the values of the features. For example, the nonterminals NP and VP might be given the feature number, which can be tested to make sure that singular subjects go with singular verbs and plural subjects go with plural verbs. Nevertheless, even natural-language parsers that use one of these more complex grammar formalisms are usually based on some extension of one of the well-known algorithms for parsing with context-free grammars.
Grammars for artificial languages, such as programming languages (C++, Java, etc.) or text mark-up languages (HTML, XML, etc.) are usually designed so that they can be parsed deterministically. That is, they are designed so that the grammatical structure of an expression can be built up one token at a time without ever having to guess how things fit together. This means that parsing can be performed very fast and is rarely a significant performance issue in processing these languages.
Natural languages, on the other hand, cannot be parsed deterministically, because it is often necessary to look far ahead before it can be determined how an earlier phrase is to be analyzed. Consider for example the two sentences:                Visiting relatives often stay too long.        Visiting relatives often requires a long trip.        
In the first sentence, visiting relatives refers to relatives who visit, while in the second sentence it refers to the act of paying a visit to relatives. In any reasonable grammar for English, these two instances of visiting relatives would receive different grammatical analyses. The earliest point in the sentences where this can be determined, however, is after the word often. It is hard to imagine a way to parse these sentences, such that the correct analysis could be assigned with certainty to visiting relatives before it is combined with the analysis of the rest of the sentence.
The existence of nondeterminacy in parsing natural languages means that sometimes hundreds, or even thousands, of hypotheses about the analyses of parts of a sentence must be considered before a complete parse of the entire sentence is found. Moreover, many sentences are grammatically ambiguous, having multiple parses that require additional information to chose between. In this case, it is desirable to be able to find all parses of a sentence, so that additional knowledge sources can be used later to make the final selection of the correct parse. The high degree of nondeterminacy and ambiguity in natural languages means that parsing natural language is computationally expensive, and as grammars are made more detailed in order to describe the structure of natural-language expressions more accurately, the complexity of parsing with those grammars increases. Thus in almost every application of natural-language processing, the computation time needed for parsing is a serious issue, and faster parsing algorithms are always desirable to improve performance.
“Chart parsing” or “tabular parsing” refers to a broad class of efficient parsing algorithms that build a collection of data structures representing segments of the input partially or completely analyzed as a phrase of some category in the grammar. These data structures are individually referred to as “edges” and the collection of edges derived in parsing a particular string is referred to as a “chart”. In these algorithms, efficient parsing is achieved by the use of dynamic programming, which simply means that if the same chart edge is derived in more than one way, only one copy is retained for further processing.
The present invention is directed to a set of improvements to a particular family of chart parsing algorithms referred to as “left-corner” chart parsing. Left-corner parsing algorithms are distinguished by the fact that an instance of a given production is hypothesized when an instance of the left-most symbol on the right-hand side of the production has been recognized. This symbol is sometimes called the “left corner” of the production; hence, the name of the approach. For example, if VP→V NP is a production in the grammar, and a terminal symbol of category V has been found in the input, then a left-corner parsing algorithm would consider the possibility that the V in the input should combine with a NP to its right to form a VP.