1. Field of the Invention
The present invention relates to a method of and a system for disambiguating syntactic word multiples, such as pairs of words, which may be collocates (i.e. words frequently used together). Such a method and system may be used in natural language processing (NLP), for instance for assisting in machine translation between languages.
2. Description of the Related Art
The term "disambiguating" when applied to a word (or group of words) means clarifying the meaning of the word (or group) with reference to the context in which the word (or group) occurs. For example, the verb "fire" can be used to describe a shooting act, such as "fire pistol", or the act of dismissal from employment, such as "fire employee". Disambiguating the verb "fire" in "fire pistol" would comprise clarifying that the verb is used in the "shooting" sense.
It is known in NLP systems to provide syntactic analysis whereby a "parser" analyses input or stored text into the different "parts of speech". However, in natural languages, the same word with the same spelling and part of speech can have different meanings according to the context in which it occurs. For example, as described above, the verb "fire" describes a shooting act in the context of "fire pistol" and an act of dismissal from employment in the context of "fire employee". In such cases, syntactic analysis as available through conventional parsers is unable to clarify the meaning of words in context. There is therefore a need for "word disambiguation" in order to complement syntactic analysis in NLP systems.
A first step in word disambiguation may be performed by clustering word senses in terms of semantic similarity. Word senses may, for instance, be found in electronic dictionaries and thesauri. Semantic similarity can be assessed from electronic thesauri where synonymic links are disambiguated i.e. each synonymic link relates specific word senses.
Words are said to be "semantically similar" or "semantically congruent" if their meanings are sufficiently close. Closeness in meaning may be established in terms of equivalance or compatibility of usage. For example, the words "gun" and "pistol" are very close in meaning because they can be used to describe the same object. Similarly, the words "ale" and "beer" are very close in meaning because the first is a specific instance of the second i.e. "ale" is a type of "beer". The notion of semantic similarity can also be used in a relative sense to express the degree to which words are close in meaning. In this case, there may be no equivalence or compatibility in meaning. For example, there is a clear sense in which "doctor" and "nurse" are closer in meaning than "doctor" and "computer". Although "doctor" and "nurse" are neither equivalent nor compatible in meaning because they describe distinct professions, they both refer to a person who is trained to attend sick people. The words "doctor" and "computer" share little else beyond the fact that both refer to concrete entities. In this case, the specificity of the shared concept (for instance "person who is trained to attend sick people" versus "concrete entity") determines the relative semantic similarity between words.
There are several known techniques for finding semantic similarity between two words using machine readable thesauri. Examples of such techniques are disclosed in European Patent Application No. 91480001.6 and in papers published by Resnik in 1995 entitled "Using Information Content to Evaluate Semantic Similarity in a Taxonomy" (IJCAI-95) and "Disambiguating Noun Groupings with Respect to WordNet Senses" (third workshop on very large corpora, Association for Computational Linguistics). The techniques disclosed by Resnik make use of the WordNet Lexical Database disclosed by Beckwith et al, 1991, "WordNet: A Lexical Database Organised on Psycholinguistic Principles" in Lexical Acquisition, LEA, Hillsdale, N.J.
Resnik defines the semantic similarity between two words as the "entropy" value of the most informative concept subsuming or implied by the two words. This assessment is performed with reference to a lexical database such as WordNet mentioned hereinbefore, where word senses are hierarchically arranged in terms of subsumption links. For example, all senses of the nouns "clerk" and "salesperson" in WordNet are connected to the first sense of the nouns "employee", "worker", "person" so as to indicate that "clerk" and "salesperson" are a kind of "employee" which is a kind of "worker" which in turn is a kind of "person". In this case, the semantic similarity between the words "clerk" and "salesperson" would correspond to the entropy value of "employee", which is the most informative (i.e. most specific) concept shared by the two words.
The informative content (or Entropy) of a concept c (such as a set of synonymic words such as fire, dismiss, terminate, sack) is formally defined as: EQU -log p(c)
where p is the probability of c. The probability of c is obtained for each choice of text sample or collection K by dividing the frequency of c in K by the total number W of words which occur in K and which have the same part of speech as the word senses in c. This may be expressed as: EQU p(c.sub.pos)=(freq(c.sub.pos))/(w.sub.pos)
where pos specifies the same part of speech. The frequency of a concept is calculated by counting the occurrences of all words which are an instance of (i.e. subsumed by) the concept: every time that a word w is encountered in K, the count of all concepts subsuming w is increased by one. This may be expressed as: ##EQU1##
The semantic similarity between two words W1 and W2 is expressed as the entropy value of the most informative concept c which subsumes both W1, W2:
sim(W1, W2)=.sub.c.epsilon.{x .LAMBDA.subsumes(x,W1) max.sub..LAMBDA.subsumes(x,W2)} [-log p(c)]
The specific senses of W1, W2 under which semantic similarity holds can be determined with respect to the subsumption relation linking c with W1, W2. For instance, if it is found that, in calculating the semantic similarity of the two verbs "fire" and "dismiss" using the WordNet lexical database, the most informative subsuming concept is represented by the synonym set containing the word sense remove_v.sub.-- 2, then it is known that the senses for "fire" and "dismiss" under which the similarity holds are fire.sub.-- 4 and dismiss_v.sub.-- 4 because these belong to the only set of word senses which are subsumed by remove_v.sub.-- 2 in the WordNet hierarchy.