1. Technical Field
The invention disclosed herein broadly relates to data processing methods for natural language processing (NLP), and more particularly relates to an improved data processing method for determining the basic semantic structures of sentences.
2. Background Art
Natural language texts may be said to consist of groups of propositions made up of predicates and their arguments. An example of a predicate is a verb, and its arguments can be exemplified by associated nouns or noun phrases. For example, in the sentence:
John loves Mary,
there is one proposition, whose predicate is the verb "loves." "Loves" has two arguments in this proposition: "John" and "Mary."
In order for a computer system to understand natural language, it must be able to identify, correctly, the predicate and argument groups. For a simple sentence like the one above, this is not hard. If an English verb is closely surrounded by its arguments (as in "John loves Mary" above), then it is relatively easy for the computer grammar to assign the proper arguments to the verb. But for more complicated sentences, such as many that appear in real-life text, the task becomes much more difficult. The difficult problem arises when the arguments are not close to their verb.
In fact, arguments may sometimes be missing from the sentence entirely, and yet must be inferred by the program, just as a human would infer them. For example:
Mary was kissed.
In this sentence, the only visible argument for the verb "kissed" is "Mary." But we can infer another argument, corresponding to some person who did the kissing. Another, related, situation occurs in sentences like:
Who did Mary think that Peter said that John kissed?
In the foregoing sentence, again there are two arguments for the verb "kissed." "John" is close by, but "who," the second argument, is far away from its verb. The problem, then, is properly to link all arguments, including the missing and far-removed ones, with their predicates.
The problem of identifying predicate-argument structures--and, in particular, of correctly assigning "long-distance" dependencies" as in "Who did Mary think that Peter said that John kissed?"-- is well known in the literature of linguistics and computational linguistics. Two chief methods have been described for accomplishing this:
the "empty category" (EC) approach; PA1 the "functional uncertainty" (FU) approach. PA1 a. It does not use empty categories or traces of any kind; PA1 b. It does not rely so heavily on the constituent, or tree, structure, but rather uses all sorts of information provided by the syntactic parse. PA1 a. It does not use any special notational devices other than those already provided by the programming language used; PA1 b. It does not rely so completely on characteristics of verbs in the sentence (the so-called "functional information"), but rather uses all sorts of information provided by the syntactic parse. PA1 a. DSUBJECT--"deep" (or semantic) subject of the proposition; generally, the doer of an action PA1 b. DOBJECT--"deep" (or semantic) object of the proposition; the entity that is most directly affected by the action of the doer PA1 c. DINDOBJ--deep indirect object; the entity that experiences something, or retrieves something, through the action of the doer PA1 d. DPREDNOM--the entity that is equated with the DSUBJECT in a proposition PA1 e. DOBJCOMP--the entity that is equated with the DOBJECT in a proposition PA1 Missing arguments of infinitive clauses and participial clauses are assigned. PA1 Displaced or "long-distance" arguments are assigned. PA1 Missing or displaced arguments is passive constructions are assigned. PA1 Arguments for the two different forms of the indirect object construction in English are equated. PA1 a. MODS--modifier; not further specified PA1 b. NADJ--adjective premodifying noun PA1 c. PADJ--predicate adjective or adjective postmodifying noun PA1 d. OPS--operator; includes determiners and quantifiers PA1 e. PARTICL--preposition or adverb that combines with a verb to signal a significant change in the argument structure of the verb phrase PA1 f. PRED--basic form of each word PA1 g. PROP--propositional modifier; may include infinitives and participial phrases PA1 h. REF--the noun to which a pronoun refers
The EC approach is advocated, for example, by linguists of the Government and Binding (GB) and Generalized Phrase Structure Grammar (GPSG) schools. (Sells, P. Lectures on Contemporary Syntactic Theories, CSLI, Stanford University, Stanford Calif., 1985) This approach uses parse structures that contain empty slots in the places wherein the dislocated constituents might be, if the sentence were in its most neutral form. For example, the sentence
Alice, Peter said that John kissed. (=Peter said that John kissed Alice.) is supposed to have an "empty category," or "trace" (symbolized by "c"), right after the verb "kissed," because that is where the noun phrase "Alice" belongs. Computational grammars that are built along these lines actually specify empty slots in their parse structures, or trees (see FIG. 1A).
The FU approach is advocated by linguists who adhere to the theories of Lexical Functional Grammar (LFG). This approach bases its solution not on empty slots in a parse tree, but rather on the incremental evaluation of the characteristics of all the verbs ("characteristics" chiefly refers to the required number and kind of arguments that a verb must have), from left to right in a sentence, in order to find out where the displaced constituent best fits. A formal notational device has been added to the LFG grammar-writing language, for the purpose of computing the properly filled argument structures. (Kaplan, R. M. and A. Zaenen, "Long-distance Dependencies, Constituent Structure, and Functional Uncertainty", in M. Baltin and A. Kroch, eds., Alternative Conceptions of Phrase Structure, Chicago University Press, 1987.) Computational grammars that are built along these lines use this device, in their grammar rules, to specify where the missing argument should be assigned.
The present method differs from both of these approaches. It differs from the EC approach in that:
It differs from the FU approach in that:
It differs from both of the above approaches in that it performs the argument-filling after the syntactic parse has been completed. It uses a post-processor, and not the parsing component itself, to manipulate the full range of syntactic attribute-value information, in order to derive the most reasonable argument structure.
An additional difference between the present method and the methods of NLP systems that are motivated by linguistic theories is the fact that most of the latter systems currently use some form of unification, such as that provided by the logic programming languages. Unification allows for an automatic matching of attribute-value structures; but it has several drawbacks, such as its inability to deal elegantly with conditions of negation and disjunction. The present method, using a procedural post-processor, suffers no such drawbacks.
The present method is highly efficient; the post-processor adds no measurable time to the operation of the system. In addition, because the initial parsing component is completely domain-independent, the entire system provides extremely broad coverage for English.
Although the EC approach and the FU approach dominate current linguistic theory, neither one has been widely adopted in applications that make use of NLP techniques today. Prior art applications that include a semantic analysis of English text generally make use of some form of lexically-driven argument identification, but do not necessarily embrace the techniques or formalisms of EC or FU.
A prior art method for semantic processing of English text is disclosed in the Proceedings of the 25th Annular Meeting of the Association for Computational Linguistics, Stanford University, 6-9 Jul. 1987, pp. 131-134. The method disclosed therein is briefly explained below.
The prior art system is designed to handle a single semantic domain, namely, reports of failures in a specific type of machinery used on Navy ships. When an English sentence from this domain is inputted, the system makes a syntactic analysis of the sentence, and then maps the syntactic analysis onto an underlying format, or template, that specifies how many arguments can be related to the verb of that sentence, and what sorts of arguments those should be. Three different classes of arguments are defined: (1) obligatory, (2) essential, and (3) non-essential. Obligatory arguments must be present in the syntactic analysis, or the parse fails. Essential arguments need not be present in the syntax; but, if they are not, the system will hypothesize some "best guess" candidate to fill the role. Therefore both the essential and the obligatory arguments end up being present in the semantic structure of the sentence. Non-essential arguments may or may not be present.
For example, given the input sentence "Pump failed," the syntactic analysis should give "failed" as the main verb and "pump" as its syntactic subject. The underlying template for the verb "fail" should indicate that it has one argument, called the PATIENT. A mapping rule then suggests that "pump" is a good candidate for the PATIENT argument (arguments are also called "roles"). Next, restrictions are tested. For the verb "fail," there is a restriction saying that the filler of the PATIENT role must be a mechanical device. (In general, such information is carried by a feature--say, +MECH--that is marked on the dictionary entry for the noun "pump.") Since "pump" checks out as a mechanical device, the argument structure is completed: "failed" has one argument, its PATIENT, which is filled by "pump."
However, the prior art argument-filling method has several problems, as discussed below.
First, the possible meanings that words can have are severely limited, including only those that pertain to the domain in question. For example, the verb "fail" can have the meaning associated with sentences like:
The equipment failed,
in which it has one obligatory argument ("equipment"). But the system may not interpret the verb "failed" in sentences like:
His courage failed him.
Today I took the chemistry exam and failed me a whopper!
The system counts on the fact that such sentences usually do not appear within the narrowly defined subdomain. But people use language in unpredictable ways; there is no guarantee that the verb "fail" would never be used, in Navy ship reports, with something like the meanings used above. The only way for the system to handle such sentences would be by means of additional templates for "fail." However, additional templates may cause much trouble for the syntactic analysis component.
Second, the process is complicated by the necessity to separate, for each verb, the three classes or arguments: obligatory, essential, and non-essential. The number of obligatory arguments varies with each different sense of a verb, and it is very difficult to specify precisely how many senses any given verb may have, even within a particular semantic subdomain.
Third, the flow of the system is hampered by the requirement that all essential arguments be filled, even if the filler is only a "best guess" hypothesis. In cases where fewer arguments are present in the syntactic structure than are required by the lists of obligatory and essential arguments, it is often necessary for the system to fail, back up, and try again, before achieving a successful parse for the sentence.
Fourth, in the prior art system, little or no attention is paid to the trickiest kinds of argument-filling, such as the "long-distance dependencies" discussed above. Again, the system counts of the fact that such complicated constructions are not expected to occur in narrow subdomains. Given the flexible nature of natural language, however, this is not a totally safe expectation.
Theoretical approaches to argument-filling discussed above (EC and FU) deal with the complexities of natural language, but their intrinsic complications make them difficult to use in practical applications. Prior art applications, although useable in the real world, within semantic subdomains, do not provide techniques for dealing with the full complexity of natural language, and will therefore remain limited in their scope of application.
Reference is made to U.S. Pat. No. 4,731,735 to K. W. Borgendale, et al., assigned to IBM Corporation, entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, With Full Command, Message and Help Support," for its disclosure of a data processing system in which the invention disclosed herein can be executed. The disclosure of the above cited patent is incorporated herein by reference to serve as a background for the invention disclosed herein.