The existence of field-dependent translations of terms has long been a problem for both ordinary human translation and for machine translation. A term in a source language, for example, Japanese, may have more than one translation in a target language, for example, English, depending on the subject, topic or field of the document being translated. For example, the word “soshiki” in Japanese would be translated to the English “tissue” in a medical document, to the English “weave” in the case of textiles, or to the English “microstructure” in the case of metallurgy.
Conventional machine translation programs, for example, Systran®, contain topical dictionaries or glossaries. The user must manually select topical dictionaries appropriate for the document being translated. In this case, there is one dictionary per topic, for example, chemistry or medicine, rather than one topic per dictionary entry or record as in the current invention.
The machine translation program METAL contains three individual lexicons: a German monolingual lexicon, an English monolingual lexicon and a German-English bilingual lexicon (Katherine Koch, “Machine Translation and Terminology Database—Uneasy Bedfellows?” Lecture Notes in Artificial Intelligence 898, Machine Translation and the Lexicon, Petra Steffens, ed., Springer, Berlin, 1995, pp. 131-140.) Semantic information is disclosed only for the monolingual lexicons, not for the bilingual lexicon. Even for the monolingual lexicon, only 15 semantic types are disclosed, such as “abstract”, “concrete”, “human”, “animal” and “process.” These are quite different from the topical classifications that are the subject of the current invention.
Brigitte Blaser, in “TransLexis: An Integrated Environment for Lexicon and Terminology Management,” Lecture Notes in Artificial Intelligence 898, Machine Translation and the Lexicon, Petra Steffens, ed., Springer, Berlin, 1995, pp. 158-173, discloses the incorporation of concepts, including broader concepts, narrower concepts and related concepts in a lexicon database management system for machine translation. However, this disclosure does not extend to the incorporation of subject codes, notably hierarchical subject codes, in a multilingual electronic dictionary not does it disclose the use of concepts or other subject area information for automatic topic discrimination in machine translation. Notably, these concepts are not subject areas; rather, they constitute the interlingua for interlingua-based machine translation.
Masterson disclosed a means of automatic sense disambiguation for the machine translation of Latin to English in the article “The thesaurus in syntax and semantics,” Mechanical Translation, Vol. 4, pp. 1-2, 1957. As described by Wilks, Stator, and Guthrie in Electric Words: Dictionaries, Computers and Meanings, MIT Press, Cambridge, Mass., 1996, pp. 88-89, Masterson disclosed a nonstatistical method using the headings in Roget's Thesaurus.
In this predecessor to interlingua-based machine translation, Masterson disclosed a concept thesaurus for the words in a Latin passage from Virgil's Georgics. Each word stem from the Latin passage was associated with a set of head numbers from Roget's International Thesaurus by translating the word stems into English and selecting the head numbers for the corresponding English words. For example, the three Latin noun stems, “agricola”, “terram” and “aratro” have the following heads (where the head words are shown instead of the head numbers):                AGRICOLA: Region, Agriculture        TERRAM: Region, Land, Furrow        ARATRO: Agriculture, Furrow, Convolution        
In the case of the text, “Agricola incurvo in terram dimovit aratro”, the heads that occur more than once are selected into a concept set. In the above example, this yields the following sets:                AGRICOLA: Region, Agriculture        TERRAM: Region, Furrow        ARATRO: Agriculture, Furrow        
Finally, the English words listed under each head in Roget's Thesaurus are intersected to leave the appropriate translation candidates. In the current example, this yields the following sets:                AGRICOLA: farmer, ploughman        TERRAM: soil, ground        ARATRO: plough, ploughman, rusticMasterson does not disclose a multilingual dictionary, nor does she disclose use of topical codes in a multilingual dictionary for disambiguation.        
Kenneth W. Church, William A. Gale and David E. Yarowsky, in U.S. Pat. No. 5,541,836, also disclosed the use of the categories from Roget's Thesaurus in automatically disambiguating word/sense pairs and the use of bilingual bodies of text to train word/sense probability tables. Church et al do not disclose a multilingual dictionary nor do they disclose the use of topical codes in a multilingual dictionary for sense disambiguation.
JuneJei Kuo, in U.S. Pat. No. 5,285,386, “Machine Translation Apparatus Having Means for Translating Polysemous Words Using Dominated Codes”, discloses interlingua-based machine translation using semantic codes in the role of the interlingua. While Kuo discloses transfer dictionaries, these are not multilingual transfer dictionaries. Rather they are transfer dictionaries between the semantic codes, the interlingua in this case, and words in the target language.
The requirement of manually selecting a topical dictionary is a barrier to the automated translation of documents such as patent documents that cover many topical areas. Also, the semantic methods of the interlingua-based approaches do not provide for automatically determining the topic of the document being translated. There is a need for a means for automatically determining the most appropriate target definition depending on the topic of the document. Such a means is referred to as “automatic topic disambiguation” in the text below.
Elizabeth Liddy, Woojin Palk and Edmund Szi-li Wu, in U.S. Pat. No. 5,873,056, “Natural Language Processing System for Semantic Vector Representation Which Accounts for Lexical Ambiguity”, disclose a monolingual lexical database that contains nonhierarchical subject codes assigned to each word in the database. To avoid unnecessary reiteration of prior teachings, the disclosure of each reference cited herein is hereby incorporated by reference.