The present invention is directed to a natural language sentence parser.
Natural language processing is hindered by the inability of machines to recognize the function of words as they appear in their context. The context for the words are the sentences in which they are framed. The functions of a word are indicated by the word""s syntax.
The task is complicated by the fact that words can be used in several parts of speech. For instance, the word xe2x80x9cfinexe2x80x9d could be a noun, a verb, an adjective, or an adverb. The single most important task in the machine parsing of natural language is to be able to identify which part of speech a word is being used as. One of the most complicating factors in resolving parts of speech of words in English is that many nouns can also be verbs. The articles, adjectives, and possessive pronouns are very important cues to resolve this problem, as illustrated in the case of xe2x80x9ca fine vase.xe2x80x9d Since the word fine follows an article, a rule can be established and applied in which fine cannot be a verb or an adverb. Once that rule has been applied, the phrase xe2x80x9ca fine vasexe2x80x9d can be merged into a noun phrase regardless of whether the word xe2x80x9cfinexe2x80x9d is a noun or an adjective.
The ability to use a computer to determine the appropriate syntax for sentences permits computers to participate in analysis of enormous amounts of information such as news reports from around the world. Analysis of such large data bases can be useful in plotting trends in terms of a general understanding of, for example, violence or political unrest in various parts of the world. Alternatively, analysis may be conducted to plot news trends and how they relate to various stock market performance indices. Numerous such analyses are possible but in order to obtain meaningful interpretation from any such analysis, the system must be able to parse sentences in the raw data.
A news analyzer would begin with a filter formatter which identifies the beginning and end of a sentence. The filter formatter needs to distinguish between periods that are found in the middle of a sentence and those which are found at the end of a sentence. Each sentence may then be provided to a parser for determining the syntax of the Each sentence may then be provided to a parser for determining the syntax of the sentence. With the syntax of the sentence automatically determined, it then becomes possible to identify the action or verb set forth in the sentence, the subject of the sentence and the object of the action. The parsed sentence is then provided to an events generator arranged in accordance with the particular news analysis desired. The events generator would look for particular words of interest to the particular analysis being performed. In conjunction with the parsing of the sentence, the import of the various words can be better determined and more properly characterized in the final analysis. Events of import can be counted and associated with categories such as areas of the world. Such counted information can then be displayed or analyzed in chart or report format. The reliability of the analysis can be significantly enhanced by providing a parser that reliably identifies the proper syntax of the sentence.
In accordance with the method of an embodiment of the invention, words in a sentence are tokenized whereby a list of syntactic identifiers corresponding to the word are indicated. Syntactic identifiers encompasses parts of speech as well as other indicators of word usage. The tokens comprised of the list of syntactic identifiers are used consecutively and compared with a first list of rules in order to produce a narrower set of possible syntactic interpretations of the words of the sentence. Syntactic identifiers in the token may be deleted or replaced by identifiers covering a smaller class of words. This token merging step is repeated until no further changes can be determined for the sentence at that level of rules. Using the narrower set of possible interpretations, token merging proceeds by matching the current set of tokens against a second list of rules. Further reduction in the number of syntactic interpretations is made possible. The first level token merging and second level token merging are reiterated until no further reductions in the syntax of the sentence can be made.
Another embodiment may include the step of matching consecutive words in a sentence with multiple words in a dictionary. If the dictionary contains possible syntactic identifiers for the consecutive words used in conjunction, then a token for the matched multiple words is substituted for the tokens of each of the individual words. A still further embodiment follows up on the method with deductive token merging. When several rules in a given list are matches for a sentence, in accordance with an embodiment of the invention, a longer of the applicable rules is applied.
The rules may include substitution rules which retain the number of tokens but substitute or delete syntactic identifiers therein and concatenation rules which eliminate tokens. If both a substitution and a concatenation rule may be applied to a series of tokens, then the substitution rule is preferred and applied. The deductive token merging, may include referring to a polysemy count to determine a most frequently preferred part of speech for a particular word in a sentence.
A further embodiment of the invention is directed to a computer program product in which computer readable program code is present on a computer usable medium. The code includes a tokenizing code, first inductive merging program code which applies a first set of rules to consecutive tokens from an input sentence, a second inductive merging program code which applies a second set of rules to the narrower set of syntactic interpretations obtained from the first inductive merging program code and reiteration program code for cycling through the first and second inductive merging program codes until no further reductions in the syntactic interpretations are possible. The program code may further include multi-word matching program code.
A further embodiment of the invention is directed to a sentence parser having a tokenization module, a replaceable set of first substitution and concatenation rules, a replaceable set of second substitution and concatenation rules and an iterative inductive processor for reducing the syntactic possibilities for a sentence in accordance with matching against the rules. The parser may further include a multi-word comparator.