The present invention relates to natural language processing. More particularly, the present invention relates to the field of parsing natural human language.
In processing natural languages, such as English, Hebrew and Japanese, a parser is typically used in the analysis of sentences. A parser determines, for a sentence, the roles the words play and the interdependencies between those words. A first stage in parsing is breaking the input sentence into words and looking those words up in a lexicon to determine what parts of speech (POS) any word can have. For example the word “brachiate” can only be a verb, the word “sentence” could be a verb or noun, and “still” could be a noun, verb, adverb, as well as other parts of speech. There are also individual words that, when adjacent in a sentence, act as a unit as a different part of speech. For example “kind of” is treated as an adverb in the sentence “I kind of like her.” But in the sentence “It is a kind of cabbage,” the word “kind” is a noun and the word “of” is a preposition. Similar sets are “sort of,” “at least,” and “on the other hand.” These sets are called Multi-Word-Entries, or MWEs. A parsing system can assign many different types of parts of speech.
In addition, some parsers construct larger units from individual words before doing the syntactic parse. These larger units generally have internal structure that is not syntactic. For example, street addresses, time of day, and proper names all have internal structure that must be dealt with outside of syntax. In the sentence “David Parkinson visited 123 Elm Street at 11:30 AM,” the emphasized units (“David Parkinson”, “123 Elm Street” and “11:30 AM”) can be treated as a larger unit by the grammar component (which is responsible for syntax). These units are called factoids. However, some sentences could have conflicting factoids. For example in the sentence “After 1 second St. Augustine appeared” there are two overlapping factoids: “1 second St.” (which could be a street address), and “St. Augustine” (which could be a saint's name.) Other sentences could have items in them that might be incorrectly identified as a factoid if the entire context of the sentence is not considered. For example, in the Sentence “After I saw Henry Nixon walked into the room” we do not want “Henry Nixon” as a factoid.
The speed of a parser is dependent on how many different combinations of words it has to put together before it achieves a parse that spans the input sentence. There are many dead ends it could explore before finding the right way. For example, if it considers a part of speech for a word that does not make sense by building larger structures using it, then all that work is for naught. In a similar vein, if it considered a MWE as a first when it shouldn't have, or a factoid when the individual units are the correct one, then the parsing of that sentence will be slow. In addition, incorrect parses can be produced if the parser considers wrong parts of speech first. If an incorrect parse is generated before the correct one, and the parser decides to stop looking for the correct one, then an incorrect parse will be produced. The accuracy of factoid identification is also an issue. Confidence in whether a span is a factoid is done at two places in the system. When the factoid is constructed, and when the parse completes.
A Hidden Markov Model (HMM) using trigrams is a standard technique to predict which part of speech for each word is the preferred one in a given sentence. However, techniques for determining the parts of speech for MWE's and factoids are needed to improve parser performance. Also needed to improve parser performance are techniques to determine whether larger units, such as MWE's and factoids, should he considered first, or whether their individual pieces should be considered first by the parser.