1. Field of the Invention
The present invention relates to a natural language processing apparatus and a method for analyzing a natural language sentence using a dictionary and grammar data.
2. Related Background Art
For an analysis of a document written in a natural language, the definitions and properties of the individual words in a sentence are specified by referring to a dictionary. However, in a natural language sentence, various words appear for which a dictionary can provide no specific descriptions. For example, since new models of a product are produced sequentially, the names of all those models can not be specifically registered in a dictionary.
Further, with reference to onomatopoeic or mimetic words, the "z", as in "zzzzz", that can be used in English to represent the breathing sound produced by a sleeper, can be repeated an arbitrary number of times, so that in addition to "zzzzz", we could also have "zzzzzzzzzz". And thus, since for such an expression an infinite number of descriptive variations can be produced, the registration in a dictionary of all the notations for the expression is neither feasible nor possible.
Furthermore, in a natural language sentence numerical expressions such as the following may appear: "1093", "5,000,000", "7.5", "1/2", "10-20", "2, 3", "5-3=2", "1997. 06. 25", "Jun. 25, 1997", "10:31", "03-3123-4567", and "2:3".
In consonance with the forms used for the expressions, these numerals represent the following: "integers", a "decimal fraction", a "fraction", "round numbers", an "equation", "dates", a "time", a "number", and a "ratio".
For example, "5,000,000" represents an integer while "Jun. 26, 1997" represents a date, and when analyzing a natural language, numerical expressions such as these must be extracted from sentences and the meanings ascribed to them must be adequately identified.
Assume, for instance, that in a sentence which is to be analyzed for voice synthesization the expression "Jun. 25, 1997" appears. If this entry were merely to comprise numerals and symbols that were to be sequentially read, the resultant pronunciation product would correspond to the string of words "six slash two five slash nine seven". However, were this numerical expression to be identified as an entry that represented a "date", it would correctly be read as "Jun. twenty-fifth, ninety-seven".
Consider as another example the information extraction technique. According to this technique, elements describing who, when, what, where and how are extracted from a sentence and are expressed in the form of a table. The focus of this technique is the provision of a means by which a user can be protected from being inundated by a flood of information produced by recent computer networking developments. If, as part of the pre-processing provided for information extraction, numerals can be correctly identified during the analyzation of a natural language sentence, a date, important as information that is used to establish the "when" of an occurrence, can be correctly extracted.
For many of the above words a specific rule is used for the construction of the expressions in which they are employed. Thus, assuming that the product models are NL550, NL560, . . . , it can be ascertained that the models of the products in this series are named using the pattern "NL&lt;number&gt;".
Further, numerical expressions are not formed merely by arranging numerals and symbols, and there are rules that govern the interpretation of the contents of expressions. In a fraction, for example, normally two numbers are juxtaposed with an intervening "/" symbol, since ordinarily not more than two strings of numerals are used with an intervening "/", and in addition, in a fraction a number before or after a "/" usually does not begin with a "0". However, in a date expression that employs the same "/" symbol, three numbers may be included, as in "Nov. 05, 1997", and a number that is set off by a "/" may begin with a "0".
Furthermore, the rules governing numerical expressions depend not only on the order of the numbers and symbols, but also on the relationships of the quantities represented by the numbers. For example, when expressing round numbers, such as "2, 3", the quantity that is represented by the numeral preceding the "," must be smaller by "1" than the quantity that is represented by the succeeding numeral.
In order to correctly analyze words for which, in consonance with specific rules, an infinite number of descriptive variations can be produced, ideally the rules that are used should themselves be adequately described; but since in actuality complete descriptions are not available for all such rules, a system is required that can provide for the flexible addition, deletion, or correction of rules.