The invention relates to techniques that semantically disambiguate words using rules, referred to herein as xe2x80x9csemantic disambiguation rulesxe2x80x9d or simply xe2x80x9cdisambiguation rulesxe2x80x9d.
Segond, F., Aimelet, E., and Jean, C., previously developed semantic dictionary look-up (SDL), a technique that uses dictionary information about subcategorization and collocates to disambiguate word sense. The SDL uses a dictionary, specifically the Oxford University Press-Hachette bilingual French-English, English French dictionary (OUP-H), as a semantically tagged corpus of different languages. SDL selects the most appropriate translation of a word appearing in a given context, and reorders dictionary entries making use of dictionary information.
To extract functional information from input text in order to match against OUP-H information, SDL uses an incremental finite state parser. The parser adds syntactic information in an incremental way, depending on the contextual information available. SDL matches relations extracted by the parser against collocates in the OUP-H, and, if a match is found, SDL reorders the dictionary entry to propose the OUP-H translation that includes the matching collocate rather than the first sense in the OUP-H. In case of information conflict between subcategorization and collocates, SDL gives priority to collocates.
Dini, L., DiTomaso, V., and Segond, F. also previously developed Ginger II, a semantic tagger that performs xe2x80x9call wordxe2x80x9d unsupervised word sense disambiguation for English. To automatically generate a large, dictionary-specific semantically tagged corpus, Ginger II extracts example phrases found in the text in machine-readable dictionary entries from the HECTOR dictionary described in Atkins, S., xe2x80x9cTools for corpus-aided lexicography: the HECTOR projectxe2x80x9d, Acta Linguistica Hungarica, Budapest, Vol. 41, 1992-93, pp. 5-72. Ginger II attaches to each headword in this text the dictionary sense numbering in which the text was found. This provides the sense label for the headword in that context. Ginger II then builds a database of semantic disambiguation rules from this labeled text by extracting functional relations between words in these corpus sentences.
The rules can be on two layersxe2x80x94a word layer and/or an ambiguity class layer. The rules are extracted directly with a nonstatistical approach, using all functional relations that are found. When an example z is listed under the sense number x of a dictionary entry for the word y, Ginger II creates a rule that, in usages similar to z, the word y has the meaning x. A HECTOR sense number is used to represent the headword in a rule based on an example in the dictionary entry for that sense, while WordNet tags are used for other words in the examples. One type of rule indicates, for a specified ambiguity class, that it disambiguates as a specified one of its members when it has a specified functional relation to a specified word. Another type of rule indicates, for a specified ambiguity class, that it disambiguates as a specified one of its members when it has a specified functional relation to a specified ambiguity class.
Ginger II applies the rules to a new input text to obtain as output a semantically tagged text, giving word layer rules priority over class layer rules. If more than one rule from the same layer matches, the applier uses the notion of tagset distance in order to determine the best matching rule. The metric for computing the distance can be set by the user and can vary across applications.
The invention addresses problems that arise with the previous techniques of Segond et al. and Dini et al., described above.
The SDL technique of Segond et al. uses information from a dictionary to disambiguate word senses, but depends on finding a very precise match between a relation in input text and a collocate in a dictionary example. After obtaining information about a relation from the parser, SDL accesses a dictionary entry for a word in the relation to determine whether a matching collocate occurs in one of the senses in the entry. If a precise match occurs in one of the senses, SDL selects that sense for the word. SDL will not, however, obtain any information from a collocate that does not precisely match the input text relation. This problem, referred to herein as the xe2x80x9cprecise match problemxe2x80x9d, reduces the ability of SDL to disambiguate words in contexts that are similar but not identical to collocates.
Ginger II of Dini et al. employs disambiguation rules that include ambiguity classes. Ginger II therefore alleviates the precise match problem, because a rule with an ambiguity class may be applicable when contexts do not precisely match. But because Ginger II can produce a very large number of rules from a detailed dictionary, there are often two or more disambiguation rules at the word layer or at the class layer that match a functional relation in an input text. When this occurs, Ginger II computes a tagset distance to determine the best matching rule. In practice, however, the tagset distance technique sometimes fails to select the best rule, and instead selects a rule that produces incorrect disambiguation. This problem is referred to herein as the xe2x80x9cincorrect rule problemxe2x80x9d, and it is likely to become more serious as more dictionary information is used to produce rules, because more rules will result.
The precise match problem and the incorrect rule problem appear to be in tension: The use of ambiguity classes to alleviate the precise match problem, as in Ginger II, would make the incorrect rule problem worse. On the other hand, limiting the number of possible matches to avoid the incorrect rule problem, as the SDL technique implicitly does, would lead to the precise match problem.
The invention is based on the discovery of techniques that can alleviate both the precise match problem and the incorrect rule problem. The techniques can be used with ambiguity classes, thus alleviating the precise match problem; the techniques also provide flexible ways to select rules, making it possible to alleviate the incorrect rule problem. The techniques, as implemented, employ more dictionary information than Ginger II, thus obtaining more rules, but can nevertheless select rules without difficulty.
The techniques use disambiguation rules derived from different types of information in a corpus, and select one rule rather than another based on the types of corpus information from which the rules are derived. A detailed corpus such as a dictionary typically contains several different types of information, and rules obtained from some types of information are more likely to disambiguate a word correctly than rules obtained from other types. Because of the additional precision implicit in the rule types, the techniques sometimes lead to better rule selection than purely distance-based techniques.
In addition, the rules can include both word-based rules that specify a relation between specified words and class-based rules that specify a relation between a word and a class or between two classes. Where both a word-based rule and a class-based rule match a text, disambiguation can sometimes be improved by selecting the word-based rule rather than the class-based rule. When a text is matched by more than one rule at the same level of specificity, whether word-based or class-based, a rule can be selected based on information type.
Therefore, the invention alleviates the precise match problem by permitting class-based rules and also alleviates the incorrect rule problem by permitting selection of a disambiguation rule based on type of information. Already, the techniques can sometimes obtain better disambiguation results than conventional distance-based selection. Because selection of rules based on information type is more flexible than distance-based selection, the techniques offer the possibility of further improvements in disambiguation results.
The techniques can be implemented in a method that obtains information about a context in which a semantically ambiguous word occurs in an input text. A first rule derived from a first type of information in a corpus and a second rule derived from a second type of information in the corpus are both applicable to words occurring in the context. Based on the types of corpus information from which the rules are derived, the method selects the first rule rather than the second rule to disambiguate the semantically ambiguous word.
As noted above, rules can be derived from different types of information in a dictionary. For example, in one successful implementation, the dictionary includes a set of types of information that includes, from highest to lowest priority, collocates, idioms (using word-based rules only), compounds, structure examples, phrasal verb examples, usage, and general examples. This prioritization is based in part on the observation that selecting collocates over other types of dictionary information produces more reliable results.
Each rule can include a context descriptor specifying contexts in which the rule is applicable. Each context descriptor can include two or more word descriptors and a relation descriptor specifying a type of relation in which words that satisfy the word descriptors can occur.
The method can use the input text to obtain information about a set of relations between the semantically ambiguous word and other words in the input text. For each relation, the method can also obtain information about words that occur in the relation in the input text. The method can then compare a rule""s context descriptor with the information about relations and words in the input text. If context descriptors of first and second rules derived from first and second types of corpus information, respectively, are both satisfied by a relation between the semantically ambiguous word and other words in the input text, the method can compare the types of corpus information from which the first and second rules are derived. For example, the method can determine that the first type of corpus information has higher priority than the second type. Based on this determination, the method can then select to disambiguate the semantically ambiguous word using the first rule rather than the second rule.
On the other hand, as between the first rule and a third rule derived from the first type of corpus information, the method can select the first rule on the basis that its context descriptors are more specific than the context descriptors of the third rule. For example, each context descriptor can include two or more word descriptors: Each word descriptor in word-based rules can specify one normalized word form, while each class-based rule can include at least one word descriptor that specifies an ambiguity class of word categories. The first rule can be a word-based rule while the third rule is a class-based rule.
The techniques can also be implemented in a machine that includes an input text, a set of rules, and a processor. The set of rules can include rules derived from two or more types of information in a corpus, with each of the rules derived from at least two of the types of information being applicable to words occurring in specified contexts. In semantically disambiguating words, the processor can obtain information about a context in which a semantically ambiguous word occurs and can select to apply a first rule to the context rather than a second rule based on the types of corpus information from which the rules are derived, as described above in relation to the method.
The techniques can also be implemented in a stored rule set that includes a storage medium and rules data stored by the storage medium. The rules data can be accessible by a processor performing semantic disambiguation. The processor can access the rules data to obtain a set of semantic disambiguation rules derived from information in a corpus that include two or more types of information, such as a dictionary with the types of information listed above. The processor can also access the rules data to obtain, for each of a set of the rules, a type of corpus information from which the rule was derived.
The rules data can include an item of data for each rule in the set. The processor can access each rule""s item of data to obtain the type of corpus information from which the rule was derived. The processor can also access each rule""s item of data to obtain a type of relationship, a set of descriptors of words to which the relationship can apply, and a disambiguated sense of one of the words to which the relationship can apply. For at least one of the rules in the set, the set of descriptors can include a word and an ambiguity class applicable to another word.
In comparison with the SDL technique of Segond et al., the techniques provided by the invention are advantageous because they can avoid the precise match problem by providing class-based rules that can match any word in an ambiguity class rather than just one of the words.
In comparison with Ginger II of Dini et al., the techniques described by the invention are advantageous because they alleviate the correct rule problem by providing a more flexible approach to selecting one of a number of matching rules. Therefore, the techniques offer the possibility of extracting rules from more of the information contained in dictionary entries, for example, and of selecting from a larger set of matching rules. In addition, where rule selection based on type of corpus information is inapplicable, a simple distance-based selection can be made, thus combining in one strategy both type-based and distance-based rule selection.
The techniques also appear to be general, applicable to any language. The implementations that employ dictionaries appear to be applicable to any language for which a dictionary is available in electronic form. The techniques are especially advantageous, however, where a detailed dictionary is available, with many different types of information. The implementations that employ dictionaries also avoid the data acquisition bottleneck that has been observed for semantic disambiguation techniques. The dictionary implementations also do not require iterative validation, as would be necessary for machine-learning techniques, because conventional dictionaries provide typical usages of each sense.
Similarly, the techniques are generally useful in disambiguating by obtaining a word""s meaning. Translation is one application of disambiguation, but the techniques should be useful in many other applications.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.