A key part of adapting natural language processing (NLP) applications to specific domains is the adaptation of their lexical and terminological resources. However, parts of a general-purpose terminological resource may consistently be unrelated to and unused within a specific domain, thereby creating a persistent and unnecessary amount of ambiguity that affects both the accuracy and efficiency of the NLP application.
The present invention presents a method for processing synonyms that adapts a general-purpose synonym resource to a specific domain. The method selects out a domain-specific subset of synonyms from the set of general-purpose synonyms. The synonym processing method in turn comprises two methods that can be used either together or on their own. A method of synonym pruning eliminates those synonyms that are inappropriate in a specific domain. A method of synonym optimization eliminates those synonyms that are unlikely to be used in a specific domain.
A method for adapting a general-purpose synonym resource to a specific domain has many applications. Two such applications are information retrieval (IR) and domain-specific thesauri as a writer's aid.
Synonyms can be an important resource for IR applications, and attempts have been made at using them to expand query terms. See Voorhees, E. M., “Using WordNet for Text Retrieval,” In C. Fellbaum (Ed.), Wordnet: An Electronic Lexical Database. MIT Press Books, Cambridge, Mass., chapter 12, pp. 285-303 (1998). In expanding query terms, overgeneration is as much of a problem as incompleteness or lack of synonym resources. Precision can dramatically drop because of false hits due to incorrect synonymy relations, that is, incorrect pairings of terms as synonyms. This problem is particularly felt when IR is applied to documents in specific technical domains. In such cases, the synonymy relations that hold in the specific domain are only a restricted portion of the synonymy relations holding for a given language at large. For instance, a set of synonyms like    cocaine, cocain, coke, snow, Cvalid for English in general, would be detrimental in a specific domain like weather reports, where the terms snow and C (for Celsius) both occur very frequently, but never as synonyms of each other.
A second application is domain-specific thesauri as a writer's aid. When given a target word, thesauri in word processors generally list sets of synonyms organized by part of speech, and then by sense, e.g., for snow, a thesaurus might present a listing as follows:                noun (1) precipitation falling from clouds in the form of ice crystals snowfall        noun (2) a narcotic (alkaloid) extracted from coca leaves cocaine, cocain, coke, C        verb (1) . . .        
A thesaurus tailored to a specific domain would select, or at least order, the likely part of speech of a target word, the likely sense of that word for that part of speech, and favored synonym terms for that sense. The methods described in the present invention can help provide such functionality.
In both applications and others in NLP, the methods described in the present invention provide a way to automatically or semi-automatically adapt sets of synonyms to specific domains, without requiring labor-intensive manual adaptation.
The method of synonym pruning in the present invention has an obvious relationship to word sense disambiguation (Sanderson, M., Word Sense Disambiguation and Information Retrieval, Ph.D. thesis, Technical Report (TR-1997-7), Department of Computing Science at the University of Glasgow, Glasgow G12 (1997); Leacock, C., Chodorow, M., and G. A. Miller, “Using Corpus Statistics and WordNet Relations for Sense Identification,” Computational Linguistics, 24, (1), pp. 147-165 (1998)), since both are based on identifying senses of ambiguous words in a text. However, the two tasks are quite distinct. In word sense disambiguation, a set of candidate senses for a given word is checked against each occurrence of the relevant word in a text, and a single candidate sense is selected for each occurrence of the word. In synonym pruning, a set of candidate senses for a given word is checked against an entire corpus, and a subset of candidate senses is selected. Although the latter task could be reduced to the former (by disambiguating all occurrences of a word in a test and taking the union of the selected senses), alternative approaches could also be used. In a specific domain, where words can be expected to be monosemous (i.e., having only a single sense) to a large extent, synonym pruning can be an effective alternative (or a complement) to word sense disambiguation.
From a different perspective, synonym pruning is also related to the task of assigning Subject Field Codes (SFC) to a terminological resource, as done by Magnini and Cavaglià (2000) for WordNet. See Magnini, B., and G. Cavaglià, “Integrating Subject Field Codes into WordNet,” In M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhaouer (Eds.) Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece, pp. 1413-1418 (2000). In WordNet a set of synonyms is known as a “synset”. Assuming that a specific domain corresponds to a single SFC (or a restricted set of SFCs, at most), the difference between SFC assignment and synonym pruning is that the former assigns one of many possible values to a given synset (one of all possible SFCs), while the latter assigns one of two possible values (the words belongs or does not belong to the SFC representing the domain). In other words, SFC assignment is a classification task, while synonym pruning can be seen as a ranking/filtering task.