Many words have multiple senses. Such words are often referred to as "polysemous words." For example, the word "bass" has two main senses, namely, a type of fish and a musical range. Word sense disambiguation techniques assign sense labels to each instance of an ambiguous word. Information retrieval systems, for example, are plagued by the ambiguity of language. Searches on the word "crane" will retrieve documents about birds, as well as documents about construction equipment. The user, however, is typically interested in only one sense of the word. Generally, the user must review the various documents returned by the information retrieval system to determine which returned documents are likely to be of interest. Of course, the user could narrow the search using boolean expressions of fuller phrases, such as a search in the form: "crane NEAR (whooping OR bird OR lakes . . . )." The user, however, risks missing some examples that do not happen to match the required elements in the boolean expression.
Word sense disambiguation is important in other applications as well, such as in a text-to-speech converter where a different sense may involve a pronunciation difference. Generally, word sense disambiguation techniques presume that a set of polysemous terms is known beforehand. Of course, published lists of polysemous terms invariably provide only partial coverage. For example, the English word "tan" has several obvious senses. A published list of polysemous terms, however, may not include the abbreviation for "tangent." One such published list of polysemous terms is WordNet, an on-line lexical reference system developed by the Cognitive Science Laboratory at Princeton University in Princeton, N.J. See, for example, http://www.cogsci.princeton.edu/.about.wn/.
Hinrich Schutze has proposed a word sense disambiguation technique that identifies multiple senses of a target word by computing similarities among words that cooccur in a given corpus with a target word. Generally, the Schutze word sense disambiguation technique uses a corpus to compute vectors of word counts for each extant sense of a known ambiguous word. The result is a set of vectors, one for each sense, that can be used to classify a new instance. For a more detailed discussion of the Schutze word sense disambiguation technique, see H. Schutze, "Automatic Word Sense Discrimination," Computational Linguistics, V. 24, No. 1, 97-123 (1998).
While the Schutze technique provides an effective tool for identifying ambiguous words, the Schutze technique does not quantify how ambiguous a given word is in a given corpus. The utility of such word sense disambiguation technique could be further extended if they could rank words by their degree of ambiguity. Thus, a need exists for a word ambiguity detection tool that identifies ambiguous words and quantifies their degree of ambiguity.