Most semantic knowledge which is required in Natural Language Processing (NLP) or, e.g., in Artificial Intelligence (AI), has had to be built by hand, or hand-coded. Because the task of hand-coding semantic knowledge is time-consuming, these applications have necessarily been limited to a specific domain. In order to achieve true broad-coverage NLP, i.e., NLP unrestricted in domain, detailed semantic knowledge is required for tens, and hundreds, of thousands of words, including those which are infrequent, technical, informal, slang, etc. Constructing such semantic knowledge by hand as required in NLP, and possibly AI, is a significant problem. The problem is: how to acquire the semantic knowledge required for an unrestricted domain.
There have been some attempts to hand-code highly structured semantic knowledge for unrestricted NLP: Dahlgren (1988); Lenat and Guha (1989); and Miller et al. (1990). These attempts all demonstrate that to construct a semantic knowledge base by hand is extremely difficult. While it may be relatively simple to make decisions about how to capture words representing concrete concepts, to adequately capture the meaning of more abstract words can be much more problematic, involving difficult and sometimes arbitrary decisions about what semantic properties of a concept might be relevant. Frequently, representing some problematic concept or word can force wholesale changes in the ontology or in the set of semantic features which are assumed.
There have been some attempts to build a knowledge base using statistical information to acquire the semantic properties of words from large corpora: Basili et al. (1992); Grefenstette (1992); Grishman and Sterling (1992); Hearst (1992) and Pustejovsky et al (1993). Currently, however, none of these techniques appears capable of providing the semantic detail required for processing unrestricted text.
Our method is rooted in the tradition which attempts to construct a semantic knowledge base by identifying and extracting semantic information from a machine-readable version of a published dictionary (henceforth “on-line dictionary”). One of the earliest efforts in this general approach, which we will call dictionary-based, or DB, is Amsler (1980), which explored the possibility of constructing taxonomies (one type of semantic information) using computational methods. Although most of the ideas represented in this work were not actually implemented, this dissertation foreshadowed many of the issues which continue to confront researchers in computational lexicography. Chodorow et al. (1985) relied on string-matching to automatically extract genus terms for nouns and verbs from the on-line version of Webster's Seventh New Collegiate Dictionary (Webster 7). Markowitz et al. (1986) expanded on this general approach by attempting to discover “defining formulæ” or “significant recurring patterns” in the text of definitions—that is, syntactic or lexical patterns which appear to have been used in a consistent way by lexicographers to express a specific semantic relationship. In addition, Calzolari (1984, 1988) used string matching procedures in order to extract both genus and differentiæ information from the text of dictionary definitions.
More recently, semantic information has been extracted from on-line dictionaries in a two-step procedure, first parsing the dictionary text (the definition and/or example sentences); and then applying patterns to this syntactic information in order to improve the accuracy of the identification of semantic information. The first work of this kind was Jensen and Binot (1987), which involved parsing dictionary definitions using the PLNLP Grammar and then searching the resulting parse trees for combinations of syntactic and lexical features which could be reliably associated with semantic relationships like Part_of and Instrument as well as genus terms. Jensen and Binot show how the results of this extraction procedure can effectively help resolve the kinds of prepositional phrase attachment ambiguities encountered in free text. Related work includes Klavans et al. (1990), Ravin (1990), Verlardi et al (1991) and Montemagni and Vanderwende (1992). Montemagni (1992), meanwhile, shows that this same general methodology can be used to acquire semantic information from on-line dictionaries of Italian.
An interesting aspect of the research program begun by Jensen and Binot (1987), and continued in Jensen's later writings (see, e.g., Chapter 17 of Tomita, ed. (1991)) is its claim that dictionary entries can be effectively analyzed by a parser designed for broad-coverage text analysis. (It will be understood that a parser is a software tool that takes a text string and produces a structure corresponding thereto.) In contrast, work such as Alshawi (1989) and Slator (in Wilks et al., 1989) have relied on specially-constructed parsers which exploit the idiosyncratic syntactic properties of LDOCE entries. The advantage of relying on a broad coverage parser is that the parser need not be modified or rewritten in the course of extending the approach to other dictionaries. We see this as an important consideration, given that the huge semantic resources needed for broad-coverage NLP can only be acquired through the merging of multiple on-line dictionaries, as well as the analysis of encyclopedias and other sources.
While extracting semantic information from the parsed definitions and/or example sentences for words in the dictionary produces some semantic information for those words, the level of semantic information still is not sufficient for processing unrestricted text. There are some researchers who claim that dictionaries are too impoverished a source of semantic information to ever serve as the lexical knowledge base for sophisticated semantic processing (e.g. Atkins, Kegl, and Levin, 1986). This pessimistic view seems to be supported by a casual examination of dictionary entries. Definitions frequently fail to express even basic facts about word meanings, facts which we would obviously want to include in a knowledge base which is to serve as the basis for understanding language. A typical case is the word “flower” in Longman Dictionary of Contemporary English (henceforth, LDOCE) whose primary sense is noteworthy more for the information it omits than for what it provides:                flower (n,1) “the part of a plant, often beautiful and coloured, that produces seeds or fruit”        
Missing from this definition is any detailed description of the physical structure of flowers, information about what kinds of plants have flowers, and so on. Even the important fact that flowers prototypically have a pleasant scent goes unmentioned. We might, of course, try to increase our stock of information about this word's meaning by exploring the definitions of words used in its definition (“plant,” “beautiful,” etc.) in a way that is similar to the forward spreading activation in the networks of Veronis and Ide (1993). In this case, however, such a strategy is not especially productive, yielding general information about plants but no specific details about flowers. The question then remains: how to acquire the semantic knowledge required for an unrestricted domain.
To a great extent, the apparent inadequacy of on-line dictionaries for semantic processing can be attributed to the way in which they have been used—what we might term the forward-linking model of dictionary consultation. Given a dictionary in book form, the only way to find information about a given word involves looking it up, then exploring the semantic properties of any words mentioned in its definition, and so on. Once the data are available on-line, however, we exploit dictionary access strategies which involve not only forward-linking, but also backward-linking. That is, in looking up a word we might consult not just its own definition, but also the definition of any word which mentions it. This approach was explored in Amsler (1980) who made use of concordances, e.g. on the dictionary entry for the word “flower,” a concordance was included of all of the words which mention the word “flower” in the definitions, e.g. “petal” and “rose.” It is important to note, however, that a concordance of such terms does not make explicit the semantic relation, if any, which holds between the headword and a concordance term. As such, the presence of concordance terms for a headword does not augment the semantic information for that headword, nor does it facilitate any NL processing task, such as resolving syntactic ambiguity. Specifically, Amsler only explored discerning taxonomic information (hypernym, hyponym) from the concordances, which are of only limited use in NL processing. Amsler's concept was also cited by Chodorow et al. (1985) in developing a tool for helping human users disambiguate hyper/hyponym links among pairs of lexical items; again, however, this approach was limited to hypernym and hypernym_of semantic relations. Boguraev et al. (1989) discuss distributed lexical knowledge, in which the structure of each lexical entry is represented explicitly and the dictionary as a whole can be queried using a strategy of either query-by-example or unification. However, in Boguraev et al. (1989), the information that can be queried is only that which can be conveyed by the structure of a lexical entry, not the contents of either its definition and/or example sentences (they describe no method for extracting semantic information from the contents of the definition and/or example sentences). Moreover, as described in Boguraev et al. (1989), the distributed lexical knowledge can only be discovered by constructing queries manually, and it is described to be useful for the researcher who wants to acquire lexical information; this is in contrast to our system, which typically constructs a semantic knowledge base for consumption by a computer application, and is only incidentally useful to a researcher, and this knowledge base contains all of the relevant information pertaining to a lexical entry on that entry itself, so no query mechanism is required to associate the information which is found distributed in the on-line dictionary.
In accordance with the preferred embodiment of the present invention, a lexical knowledge base is compiled automatically from a machine-readable source, such as an on-line dictionary or unstructured text, obviating many of the drawbacks associated with the foregoing prior art techniques. The preferred embodiment of the invention makes use of “backward linking” by which inverse semantic relations are discerned and used to augment the knowledge obtained from traditional forward-linking analysis of the parsed text. Iteration of this technique can further enhance the results.
The foregoing and additional features and advantages of the present invention will be more readily apparent from the following detailed description thereof, which proceeds with reference to the accompanying drawings.