1. Field of the Invention
The present invention pertains to systems and method for increasing the semantic and logical precision of information extracted from natural language text.
2. Description of the Related Art
Treebanks
Stand-off annotation is commonly used for annotating sentences with lexical and syntactic structure for use in the development of natural language processing (NLP) systems. Repositories of such annotation typically annotate words within sentences by part of speech and constituents of sentences as parse trees where leaves of such trees are words tagged with their part of speech and internal nodes of such trees are annotated with a classification of their syntactic function, such as any of various types of phrases or clauses. Consequently, such repositories of annotations are commonly called “treebanks”. The Penn Treebank, for example, annotates words with one of several dozen part of speech tags, phrases with one of a couple dozen phrase tags, and clauses with one of several clause tags.
Treebanks consist primarily of lexical and syntactic (i.e., word, phrase, clause) tags and syntactic structure (i.e., phrase and clausal structure as in parse trees).
Treebanks typically contain little or no semantic annotation.
Treebanks are typically constructed by linguists who typically use software tools to manually annotate natural language text with tags that specify lexical and syntactic details of fragments of the text without regard to any particular grammar or NLP system or technique.
Discriminatory Treebanking
In cases where a specific NLP technique is adopted for purposes of treebanking, a technique known as discriminatory treebanking may be employed. Discriminatory treebanking uses the output of an NLP system which implements the adopted technique in formulating a set of choices which discriminate between the set of parses obtained from the NLP system for any input fragment of text.
The set of discriminants presented as choices in discriminatory treebanking may include any or all of the lexical, syntactic, and semantic information produced by the NLP system. The number of discriminants may be many times the number of words, phrases, or clauses in the input fragment of text, however. The number of lexical discriminates increases with the size of the vocabulary and the number of lexical features considered by the NLP system. The number of syntactic discriminants increases geometrically with the number of lexical discriminants as well as geometrically in the length of the input fragment of text and with the breadth of coverage or number of non-terminals in the grammar used in the NLP system. The number of semantic discriminants increases in proportion to the number word senses or ontology concepts and how they are related to the lexical or syntactic terminals or non-terminals used in the NLP system.
Since the number of syntactic discriminants that would otherwise be presented for disambiguation is combinatoric in the size of the vocabulary and coverage of the grammar, prior approaches to discriminatory treebanking have been limited to small vocabulary applications with narrow-coverage grammars or fine-grained linguistic or semantic features with little or no discrimination among syntactic or semantic structural alternatives, such as phrase or clausal ambiguities.
Each of the prior approaches to discriminatory treebanking falls short of enabling large scale acquisition of knowledge from natural language. In the first case, limitations on vocabulary and grammatical coverage effectively precludes the use of such prior discriminatory treebanking approaches on uncontrolled natural language. In the second case, fine-grained features lacking discrimination among syntactic or semantic structural alternatives (e.g., parse trees or logical atoms) limits the potential user community to those educated in formal linguistics and familiar with the technical details of the underlying NLP system. Moreover, prior feature-based approaches commonly yield scores if not hundreds of discriminants which lack context due to their fine granularity, making discrimination a more error-prone process limited to expert users who effectively search for an acceptably discriminatory path through difficult choices among many alternatives.
Limitations of Discriminatory Treebanking
Discriminatory treebanking discriminates among the set of parses provided by an underlying NLP system with the objective of determining the correct interpretation of the input text. The process of discriminatory disambiguation begins with obtaining the full set of parses from the underlying NLP system. Unfortunately, for many natural texts, the ambiguity of sentences can produce hundreds or thousands of parses when a large vocabulary, broad coverage grammar is used, as required for uncontrolled natural language text. In the case of certain NLP systems which are not constrained by a grammar, such as statistical dependency parsers, the number of possible parsers is exorbitant. Consequently, in each case, a limit on the number of parses considered is somewhat arbitrarily imposed. In the event that the correct parse is not ranked or otherwise returned within such a limited result set, discriminatory treebanking will fail. In practice, this limit is set fairly low for various reasons, including the time and space required by the NLP system and to process the results into discriminants and render them in the discriminatory user interface. Without such limits, the number of choices presented would be overwhelming in many cases and possibly impractical in the case of NLP systems such as statistical dependency parsers.
In the case of NLP systems that require lexical entries for words to exist in order to produce parses, unknown words or missing entries for certain parts of speech or argument structure are another common cause of failure in discriminatory treebanking. The set of possible parses will simply not contain parses for a word not previously known to have the part of speech or valence required by such an NLP system in order to produce the desired parse.
The parsers of most NLP systems are limited to accepting a total order of tokens from the input text. In natural text, however, there is ambiguity in how to segment text into tokens. Thus, another cause of discriminatory failure arises when the input sequence of tokens is not as required in order to produce the desired parse. In such cases, there is no discriminant for the overlooked token. Moreover, some approaches involve tagging tokens with a single part of speech before parsing. When the assigned part of speech is not as required in order to produce the desired parse, another case of discriminatory failure occurs.
Supporting the full-generality of a partially ordered lattice of tokens, each with possibly many parts of speech, increases the negative effects of limiting the number of parses retrieved from NLP systems, increases the computational burdens of parsing such that on-line NLP becomes impractical, and the number of discriminants becomes overwhelming for large vocabulary, broad coverage processing of natural language text.
Semantic Disambiguation
Treebanking tools are focused primarily upon annotating and, where supported by an NLP system, disambiguating the lexical and syntactic aspects of sentences. Treebanking tools largely ignore semantic and logical precision.
Treebanking tools that are supported by an NLP system may go further towards annotating some of the semantic and logical aspects of sentences but their support for such aspects are limited to and by the underlying NLP system. For example, the lexical annotations are limited to the lexical entries of the underlying NLP system. Similarly, the semantic aspects are limited to the semantically and logically under-specified representations provided by some NLP systems, such as dependencies or minimal recursion semantics.
Underspecified Quantification
Parses resulting from NLP systems do not produce enough information to directly obtain a formal semantics of English. In particular, most NLP systems do not even consider quantification and few NLP systems handle quantification sufficiently for the purpose of obtaining axioms of formal logic from natural language.
Lexicalized grammars, in which a richer description of lexical entries (e.g., words) are processed by the underlying NLP system, typically produce some semantic information, including underspecified quantification. Probabilistic context-free grammars resulting from statistical NLP produce phrase structure (i.e., parse) trees, but no information about logical quantifiers or their scope. Additional processing of parse trees may provide dependency structures, as produced by dependency grammars. Dependency grammars typically provide little or no direct information concerning quantification, but may be further processed to produce some constraints on the relative scope of quantifiers. In all cases, however, the resulting logic remains underspecified with regard to quantification, both in terms of the quantifier and its scope.
Quantifier Ambiguity
Most quantifications resulting from NLP are underspecified in several aspects. From a classical logic the standpoint, the most significant ambiguity is whether a quantification is existential or universal in nature. There are many other aspects of quantification and it arises in natural language, however, and most of them are typically overlooked. One such aspect is whether the universal should be interpreted as being implicative with or without implied existential quantification of its restricting or binding formulation. Another aspect is whether the quantifier is first- or higher-order over individuals, types, or sets. Another aspect is whether the quantifier is singular or plural. Another aspect is polarity, such as negative with respect to universal or existential. Another aspect is whether the quantifier is defeasible. For example, generalized quantifiers for determiners such as ‘most’ or ‘typical’ may be treated as universal quantification in defeasible logic. Similarly, generalized quantifiers for ‘few’ or quantifiers for ontological promiscuity introduced for adverbials or adjectives, such as ‘rarely’ or ‘unusual’, may be treated as negations applied to defeasible quantifiers, such as corresponding to the negation of a defeasible existential or universal.
Scope Ambiguity
Even after complete disambiguation, the logical ambiguity of most sentences is extreme. Considering only noun phrases, for example, there are as many as N factorial (N!) “readings” corresponding to the number of permutations of generalized quantifiers introduced for each noun phrase. In practice, the number of linguistically plausible logical readings is smaller but nonetheless explosive. Logical reasoning with knowledge “contained” in text requires that the disjunction of readings resulting from such NLP be effectively eliminated or that means of reasoning more directly with underspecified representation be developed.
Noun phrases are not the only source of scope ambiguities, however. Additional ambiguities arise, for example, with regard to the implicit events or situations of Davidsonian representation. Typically, Davidsonian representation of circumstances involves an implicit variable for each verb and prepositional phrase, for example. Furthermore, ellipsis frequently results in arguments of valent lexemes (especially transitive verbs, but also including transitive adjectives and nouns) being omitted from parses returned by NLP. Such elided arguments, although missing from the grammatical analysis of NLP, are typically required in axioms of formal logic using Davidsonian predicates or certain semantic roles of neo-Davidsonian representation. Each of such arguments to predicates in the axioms of formal logic may contribute an additional quantifier, thereby further exploding the logical ambiguity of even fully disambiguated parses (e.g., the final products of treebanking).
Reference Ambiguity
In under-specified semantics resulting from NLP, predicate-argument structures may involve arguments that logically refer to another, such as in the case of pronouns, definite references, and other forms of anaphora. In other cases, a predicate may logically require an argument which does not occur within the text, such as omitting the subject of a passive verb and other forms of ellipsis. Either of ellipsis or anaphora may require co-reference resolution, which may involve logical equality or inequality.
Connective Ambiguities
Depending on the underlying NLP system, the logical semantics of linguistic conjunctions, such as coordinating conjunctions, is frequently not explicit in the parse information returned. That is, even after unambiguous treebanking, the logical semantics of a conjunction may remain logically or semantically ambiguous. Such ambiguity may be structural (i.e., concerning which constituents belong to the conjunction) or semantic. Grammatical disambiguation typically resolves most structural ambiguity while typically leaving semantic ambiguities unresolved. For example, the scope of an ‘and’ may be clearly specified in a treebank, but whether the semantics is set theoretical union or disjunction (i.e., collective or distributive) typically remains underspecified. Similarly, the scope of an ‘or’ may be clarified during grammatical disambiguation, but whether its semantics is inclusive or exclusive typically remains underspecified.
Even when the structural ambiguity of connectives is resolved, whether a quantifier is within the scope of a connective versus the connective being within the scope of a quantifier typically remains unresolved after treebanking. The relative scope of quantifiers with regard to connectives is particularly troublesome and important with regard to negation (a degenerate, i.e., unary, logical connective).
Grammatical Complexity
General purpose, broad-coverage NLP systems use a variety of grammatical formalisms and techniques to parse natural language. The simplest context-free grammars (CFG) model language with a small set of non-terminals corresponding to basic parts of speech and types of phrases and clauses. The most basic parts of speech include nouns, verbs, adjectives, adverbs, conjunctions, prepositions, and articles. The most basic types of phrases include noun phrases, verb phrases, and prepositional phrases. And the most basic types of clauses include non-finite clauses, relative clauses, and sentences. The parses resulting from NLP using a CFG correspond to tree structures where the leaves are parts of speech classifying the input lexemes by part of speech, internal nodes correspond to phrases or clauses, and the root node corresponds to the input sentence.
CFG lack any semantic constraints, such as agreement between the person, number, gender, tense, aspect, mood and other semantics of natural language and logic. Consequently, the parses resulting from NLP using CFG include parses that make no sense, semantically or logically speaking. As a result, many NLP systems eschew or extend CFG in order to avoid results that make no sense. Alternatives to CFG include various formalisms, most of which may be classified as being dependency- or constituency-based formalisms.
Constituency-based formalisms typically produce parses in which a sentence comprises constituent phrases or clauses which typically comprise lexemes, phrases, or clauses. Sentence structures as traditionally depicted in elementary school are constituency-based in that each node spans part of the sentence without gaps. Dependency-based formalisms typically produce parses in which parts of the sentence are related to one another but not necessarily in a way that corresponds to a span of the sentence. Constituency-based formalisms tend to be grammars crafted by linguists while dependency grammars are common in statistical NLP systems. Constituency-based formalisms have been the basis for prior approaches to discriminatory treebanking.
CFG is a constituency-based formalism that lacks semantic constraints, as discussed above. A common technique to extend CFG is to “lexicalize” the grammar such that lexical entries for words are described with more semantics and the production rules of CFG are augmented with constraints between the non-terminals which occur in such rules that are “unified” during parsing. Such lexicalized or unification grammars and processing mechanisms include lexical function grammar, head-driven phrase structure grammar (HPSG), and combinatory categorical grammar (CCG). Parses which result from such grammars include semantic information propagated through unification from lexical entries through the production rules of the grammar to the resulting constituents of such parses. In some such NLP systems, the unification of lexical entries and grammar rules augment parse results with predicate-argument structures corresponding to parts of formulas in formal logic. In one family of HPSG systems, the predicate-argument structures resulting from NLP use a formalism known as “minimal-recursion semantics”. In another family of CCG systems, the predicate-argument structures resulting from NLP use a formalism known as “hole semantics”. Such predicate-argument structure formalisms are known as under-specified semantic representations. Such representations are under-specified in that they do not resolve the logical quantification of variables or any equalities or inequalities between such variables occurring in their predicate-argument structures (as discussed above with regard to generalized quantifiers, scope, anaphora, and co-reference resolution).
Lexicalized dependency grammars are prevalent in machine learning approaches to NLP. Unlike constituency grammars, which are typically crafted by linguists, machine learning approaches to NLP generally induce a probabilistic context-free grammar or dependency grammar given a vocabulary of lexical entries and, in many cases, a database of “gold parses” as in a treebank. Parses resulting from dependency-based NLP systems lack the non-terminal labels corresponding to production rules of a constituency-based grammar. The results of such NLP formalisms lack the predicate-argument structures discussed above, although various algorithms have been developed to produce constituency and predicate-argument structures from the results of dependency-based NLP.