1. Field of the Invention
The invention relates to computerized text analysis generally and more specifically to the problem of determining whether a given word-sense pair is proper for a given context.
2. Description of the Prior Art
Machine translation of natural language texts has long been a goal of researchers in computer science and linguistics. A major barrier to high-quality machine translation has been the difficulty of disambiguating words. Word disambiguation is necessary because many words in any natural language have more than one sense. For example, the English noun sentence has two senses in common usage: one relating to grammar, where a sentence is a part of a text or speech, and one relating to punishment, where a sentence is a punishment imposed for a crime. Human beings use the context in which the word appears and their general knowledge of the world to determine which sense is meant, and consequently do not even have trouble with texts such as:
The teacher gave the student the sentence of writing the sentence "I will not throw spit wads" 100 times. PA1 1. Qualitative Methods, e.g., Hirst (1987) PA1 2. Dictionary-based Methods, e.g., Lesk (1986) PA1 3. Corpus-based Methods, e.g., Kelly and Stone (1975) PA1 "The expert for THROW is currently six pages long, . . . but it should be 10 times that size." (Small and Reiger, 198X) PA1 determining a sequence of words in the text which includes the given position and is substantially longer than a single line of the text; and PA1 determining whether the word/sense pair has the suitable sense by automatically analyzing the sequence. PA1 determining a sequence of words in the text which includes the given position; and PA1 automatically employing a Bayesian discrimination technique involving the words in the sequence and the sense of the word-sense pair to determine the probability that the word/sense pair has a sense which suits the given position. PA1 making a first determination of the sense of the given occurrence of the word; and PA1 making a final determination of the sense of the given occurrence of the word by comparing the first determination with a determination of the sense of a neighboring occurrence of the word.
Computers, however, have no general knowledge of the world, and consequently, have had a great deal of trouble translating sentences such as the above into languages such as French, where the word used to translate sentence when it is employed in the grammatical sense is phrase and the word used to translate sentence when it is employed in the sense of punishment is peine.
The ability to determine a probable sense of a word from the context in which the word is used important in other areas of text analysis as well. For example, optical character recognition systems and speech recognition systems often can only resolve a printed or spoken word into a small set of possibilities; one way of making a choice among the words in the small set is to determine which word has a sense which best fits the context. Other examples in this area are determining whether characters such as accents or umlauts should be present on a word or whether the word should be capitalized. Additionally, there are text editing tools such as spelling checkers or interactive thesauri which present the user with a set of suggested alternatives for a word. These tools, too, are improved if the set of alternatives is limited to words whose senses fit the context.
Another area of text analysis that will benefit from good techniques for determining the probable sense of a word from its context is data base searching. Word searches in data bases work by simply matching a search term with an occurrence of the term in the data base, without regard to the sense in which the term is used in the data base. The only way to restrict a search to a given sense of a term is to provide other search terms which the searcher expects to find in conjunction with the first search term. Such a search strategy will, however, miss occurrences of the first term where the first term has the proper sense but is not found in conjunction with the other search terms. Given a useful way of determining what sense of a word best fits a context, it will be possible to search by specifying not only the search term, but also the sense in which it is being used.
Past researchers have used three different general approaches to the word disambiguation problem sketched above:
In each case, the work has been limited by a knowledge acquisition bottleneck. For example, there has been a tradition in parts of the AI community of building large experts by hand, e.g., Granger (1977), Rieger (1977), Small and Rieger (1982), Hirst (1987). Unfortunately, this approach is not very easy to scale up, as many researchers have observed:
Since this approach is so difficult to scale up, much of the work has had to focus on "toy" domains (e.g., Winograd's Blocks World) or sublanguages (e.g., Isabelle (1984), Hirschman (1986)). Currently, it is not possible to find a semantic network with the kind of broad coverage that would be required for unrestricted text.
Others such as Lesk (1986), Walker (1987), Ide (1990, Waterloo Meeting) have turned to machine-readable dictionarys (MRD) such as Oxford's Advanced Learner's Dictionary of Current English (OALDCE) in the hope that MRDs might provide a way out of the knowledge acquisition bottleneck. These researchers seek to develop a program that could read an arbitrary text and tag each word in the text with a pointer to a particular sense number in a particular dictionary. Thus, for example, if Lesk's program was given the phrase pine cone, it ought to tag pine with a pointer to the first sense under pine in OALDCE (a kind of evergreen tree), and it ought to tag cone with a pointer to the third sense under cone in OALDCE (fruit of certain evergreen trees). Lesk's program accomplishes this task by looking for overlaps between the words in the definition and words in the text "near" the ambiguous word.
Unfortunately, the approach doesn't seem to work as well as one might hope. Lesk (1986) reports accuracies of 50-70% on short samples of Pride and Prejudice. Part of the problem may be that dictionary definitions are too short to mention all of the collocations (words that are often found in the context of a particular sense of a polysemous word). In addition, dictionaries have much less coverage than one might have expected. Walker (1987) reports that perhaps half of the words occurring in a new text cannot be related to a dictionary entry.
Thus, like the AI approach, the dictionary-based approach is also limited by the knowledge acquisition bottleneck; dictionaries simply don't record enough of the relevant information, and much of the information that is stored in the dictionary is not in a format that computers can easily digest, at least at present.
A third line of research makes use of hand-annotated corpora. Most of these studies are limited by the availability of hand-annotated text. Since it is unlikely that such text will be available in large quantities for most of the polysemous words in the vocabulary, there are serious questions about how such an approach could be scaled up to handle unrestricted text. Kelly and Stone (1975) built 1815 disambiguation models by hand, selecting words with a frequency of at least 20 in a half million word corpus. They started from key word in context (KWIC) concordances for each word, and used these to establish the senses they perceived as useful for content analysis. The models consisted of an ordered set of rules, each giving a sufficient condition for deciding on one classification, or for jumping to another rule in the same model, or for jumping to a rule in the model for another word. The conditions of a given rule could refer to the context within four words of the target word. They could test the morphology of the target word, an exact context word, or the part of speech or semantic class of any of the context words. The sixteen semantic classes were assigned by hand.
Most subsequent work has sought automatic methods because it is quite labor intensive to construct these rules by hand. Weiss (1973) first built rule sets by hand for five words, then developed automatic procedures for building similar rule sets, which he applied to an additional three words. Unfortunately, the system was tested on the training set, so it is difficult to know how well it actually worked.
Black (1987, 1988) studied five 4-way polysemous words using about 2000 hand tagged concordance lines for each word. Using 1500 training examples for each word, his program constructed decision trees based on the presence or absence of 81 "contextual categories" within the context of the ambiguous word. He used three different types of contextual categories: (1) subject categories from LDOCE, the Longman Dictionary of Contemporary English (Longman, 1978), (2) the 41 vocabulary items occurring most frequently within two words of the ambiguous word, and (3) the 40 vocabulary items excluding function words occurring most frequently in the concordance line. Black found that the dictionary categories produced the weakest performance (47 percent correct), while the other two were quite close at 72 and 75 percent correct, respectively.
There has recently been a flurry of interest in approaches based on hand-annotated corpora. Hearst (1991) is a very recent example of an approach somewhat like Black (1987, 1988), Weiss (1973) and Kelly and Stone (1975), in this respect, though she makes use of considerably more syntactic information than the others did. Her performance also seems to be somewhat better than the others', though it is difficult to compare performance across systems.
As may be seen from the foregoing, the lack of suitable techniques for determining which word-sense pair best fits a given context has been a serious hindrance in many areas of text analysis. It is an object of the apparatus and methods disclosed herein to provide such techniques.