1. Field of the Invention
This invention :relates to natural language processing systems and more particularly to an apparatus and method for resolving structural ambiguities in sentences, using knowledge of the dependencies among words in sentences of natural language, and further to a method for constructing a knowledge base used in this connection. The term "words" herein signifies parts of speech such as nouns, verbs, adjectives, adverbs, and other semantic words, but excludes articles, repositions, and other functional words. A semantic unit of successive words is also regarded as one word in some fields. For example, in documents related to computer technology the expression "virtual disk" is regarded as one word. The term "dependency" indicates a modifier-modifiee relationship among words.
2. Prior Art
Resolution of structural ambiguities in sentences still remains a difficult problem for natural language processing systems. An example of tile problem is offered by prepositional phrase attachment ambiguities. The sentence, "A user can log on the system with a password.", is ambiguous as to whether the prepositional phrase "with a password" is attached adverbially to the verb "log on" or as a postmodifier to the noun phrase "the system."
Some methods have been proposed for resolving structural ambiguities in sentences based on semantic and functional information on words, phrases, and other constituent elements. One such method is theoretically based on the case grammar disclosed in the article entitled, "Toward a modern theory of case" by Chares J. Fillmore on pp. 361-375 of "Modern Studies in English", published in 1969 by Prentice-Hall. The functions of the constituent elements of a sentence for a predicate are called cases, and semantic case functions are specifically called semantic cases (see appended Table 1).
In case grammar, each constituent element of a sentence is called a case element, and the adequacy of a sentence is evaluated by matching the cases and the case elements. Taking the above-indicated sentence as an example, the term "log on" is a predicate, while "a user" functions as an agent, "the system" as an object, and "a password" as an instrument. Each verb is assigned to a framework called a case frame in which the case of each verb and the constraint conditions of case elements with respect to the verb are described.
In case grammar, acceptable case elements for a case are defined, and any input outside the definition is rejected as being semantically inadequate. In practical language usage, however, the boundary between semantically acceptable and non-acceptable sentences is a delicate one, and this also depends on the content. For example, in the sentence, "My car drinks gasoline.", if the predicate "drink" merely accepts a word indicative of a human (a word having the semantic attribute HUM) as its agent, the term "car" is rejected as the agent of "drink." However, if "the car" is considered to be used metaphorically, the term "car" is semantically accepted as an agent of the term "drink". In a system such as case grammar that uses attribute values, it is easy to construct knowledge, but its application lacks flexibility.
Japanese Published Unexamined Patent Application 63-91776 discloses a method using statistical information on the frequency of words to calculate the degree of preference of syntactic analysis trees for solving structural ambiguities. A summary of the method is given below, and some problems with it are explained.
(1) Multiple analysis trees are actually produced from an incoming sentence, and an acceptable one is selected from among them. It is troublesome to make multiple parse trees.
Furthermore, it is necessary to use information even on words that are closely related to the ambiguities.
(2) The statistical frequency of co-occurrence relationships between words is used to solve ambiguities. Therefore, individual exceptions cannot be dealt with. For example, when an ambiguity exists as to whether a certain word A modifies word. B or word C, the method does not consider that although it is statistically usual for A to modify B, in a certain particular sentence it modifies C. Further, since this method requires sufficiently formalized data (for example, registration of "virtual machine" as "machine is virtual"), collecting data is very costly.
(3) In general, the number of words in natural languages is enormous. In this connection, a category called the semantic marker, which is obtained by abstracting words, is established in order to extend the coverage range. However, it must be rearranged for a different field. For example, the term "department" is classified into the category of organization in a certain field P, and knowledge on the attachments of "department" is absorbed into statistical information on co-occurrence relationships between the organization category and another category. However, when the term "department" is classified into another category in a different field Q, the knowledge in field P is useless in field Q. It is very costly to re-abstract words and re-collect statistical information for each field.
Problems to be Solved by the Invention
As will be seen from the above description, practical semantic processing in natural language processing involves two problems. One is efficient construction of the requisite large-scale knowledge base. The other is a mechanism for efficient use of that knowledge base.
An object of the invention is to provide a natural language semantic analysis system that overcomes these two problems and is acceptable for practical use,