A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure, as it appears in the PTO patent file or records, but otherwise reserves all copyright rights whatsoever. Copyright (copyright) 1999 ISTA.
The present invention relates to multilingual electronic dictionaries that may be used for machine translation.
xe2x80x9cMultilingualxe2x80x9d means pertaining to two or more languages.
Unless the context otherwise requires, the terms xe2x80x9csubjectxe2x80x9d, xe2x80x9ctopicxe2x80x9d and xe2x80x9cfieldxe2x80x9d are virtually synonymous in this disclosure, as are the terms xe2x80x9cdictionaryxe2x80x9d, xe2x80x9cglossaryxe2x80x9d and xe2x80x9clexicon.xe2x80x9d
The existence of field-dependent translations of terms has long been a problem for both ordinary human translation and for machine translation. A term in a source language, for example, Japanese, may have more than one translation in a target language, for example, English, depending on the subject, topic or field of the document being translated. For example, the word xe2x80x9csoshikixe2x80x9d in Japanese would be translated to the English xe2x80x9ctissuexe2x80x9d in a medical document, to the English xe2x80x9cweavexe2x80x9d in the case of textiles, or to the English xe2x80x9cmicrostructurexe2x80x9d in the case of metallurgy.
Conventional machine translation programs, for example, Systran(copyright), contain topical dictionaries or glossaries. The user must manually select topical dictionaries appropriate for the document being translated. In this case, there is one dictionary per topic, for example, chemistry or medicine, rather than one topic per dictionary entry or record as in the current invention.
The machine translation program METAL contains three individual lexicons: a German monolingual lexicon, an English monolingual lexicon and a German-English bilingual lexicon (Katherine Koch, xe2x80x9cMachine Translation and Terminology Databasexe2x80x94Uneasy Bedfellows?xe2x80x9d Lecture Notes in Artificial Intelligence 898, Machine Translation and the Lexicon, Petra Steffens, ed., Springer, Berlin, 1995, pp. 131-140.) Semantic information is disclosed only for the monolingual lexicons, not for the bilingual lexicon. Even for the monolingual lexicon, only 15 semantic types are disclosed, such as xe2x80x9cabstractxe2x80x9d, xe2x80x9cconcretexe2x80x9d, xe2x80x9chumanxe2x80x9d, xe2x80x9canimalxe2x80x9d and xe2x80x9cprocess.xe2x80x9d These are quite different from the topical classifications that are the subject of the current invention.
Brigitte Blaser, in xe2x80x9cTransLexis: An Integrated Environment for Lexicon and Terminology Management,xe2x80x9d Lecture Notes in Artificial Intelligence 898, Machine Translation and the Lexicon, Petra Steffens, ed., Springer, Berlin, 1995, pp. 158-173, discloses the incorporation of concepts, including broader concepts, narrower concepts and related concepts in a lexicon database management system for machine translation. However, this disclosure does not extend to the incorporation of subject codes, notably hierarchical subject codes, in a multilingual electronic dictionary not does it disclose the use of concepts or other subject area information for automatic topic discrimination in machine translation. Notably, these concepts are not subject areas; rather, they constitute the interlingua for interlingua-based machine translation.
Masterson disclosed a means of automatic sense disambiguation for the machine translation of Latin to English in the article xe2x80x9cThe thesaurus in syntax and semantics,xe2x80x9d Mechanical Translation, Vol. 4, pp. 1-2, 1957. As described by Wilks, Slator, and Guthrie in Electric Words: Dictionaries, Computers and Meanings, MIT Press, Cambridge, Mass., 1996, pp. 88-89, Masterson disclosed a nonstatistical method using the headings in Roget""s Thesaurus.
In this predecessor to interlingua-based machine translation, Masterson disclosed a concept thesaurus for the words in a Latin passage from Virgil""s Georgics. Each word stem from the Latin passage was associated with a set of head numbers from Roget""s International Thesaurus by translating the word stems into English and selecting the head numbers for the corresponding English words. For example, the three Latin noun stems, xe2x80x9cagricolaxe2x80x9d, xe2x80x9cterramxe2x80x9d and xe2x80x9caratroxe2x80x9d have the following heads (where the head words are shown instead of the head numbers):
AGRICOLA: Region, Agriculture
TERRAM: Region, Land, Furrow
ARATRO: Agriculture, Furrow, Convolution
In the case of the text, xe2x80x9cAgricola incurvo in terram dimovit aratroxe2x80x9d, the heads that occur more than once are selected into a concept set. In the above example, this yields the following sets:
AGRICOLA: Region, Agriculture
TERRAM: Region, Furrow
ARATRO: Agriculture, Furrow
Finally, the English words listed under each head in Roget""s Thesaurus are intersected to leave the appropriate translation candidates. In the current example, this yields the following sets:
AGRICOLA: farmer, ploughman
TERRAM: soil, ground
ARATRO: plough, ploughman, rustic
Masterson does not disclose a multilingual dictionary, nor does she disclose use of topical codes in a multilingual dictionary for disambiguation.
Kenneth W. Church, William A. Gale and David E. Yarowsky, in U.S. Pat. No. 5,541,836, also disclosed the use of the categories from Roget""s Thesaurus in automatically disambiguating word/sense pairs and the use of bilingual bodies of text to train word/sense probability tables. Church et al do not disclose a multilingual dictionary nor do they disclose the use of topical codes in a multilingual dictionary for sense disambiguation.
JuneJei Kuo, in U.S. Pat. No. 5,285,386, xe2x80x9cMachine Translation Apparatus Having Means for Translating Polysemous Words Using Dominated Codesxe2x80x9d, discloses interlingua-based machine translation using semantic codes in the role of the interlingua. While Kuo discloses transfer dictionaries, these are not multilingual transfer dictionaries. Rather they are transfer dictionaries between the semantic codes, the interlingua in this case, and words in the target language.
The requirement of manually selecting a topical dictionary is a barrier to the automated translation of documents such as patent documents that cover many topical areas. Also, the semantic methods of the interlingua-based approaches do not provide for automatically determining the topic of the document being translated. There is a need for a means for automatically determining the most appropriate target definition depending on the topic of the document. Such a means is referred to as xe2x80x9cautomatic topic disambiguationxe2x80x9d in the text below.
Elizabeth Liddy, Woojin Palk and Edmund Szi-li Wu, in U.S. Pat. No. 5,873,056, xe2x80x9cNatural Language Processing System for Semantic Vector Representation Which Accounts for Lexical Ambiguityxe2x80x9d, disclose a monolingual lexical database that contains nonhierarchical subject codes assigned to each word in the database. To avoid unnecessary reiteration of prior teachings, the disclosure of each reference cited herein is hereby incorporated by reference.
This invention provides a multilingual electronic dictionary comprising a memory that contains a data structure composed of a plurality of records, each record comprising representations of the following: a first term (in a first language), a second term (in a second language), and a topical code. The topical code indicates a topical area in which the second term is a translation of the first term.
Such an electronic dictionary allows for selecting topic-appropriate translations of terms in a textual object in a first language into a second language. This is accomplished by:
(a) providing an electronic dictionary containing records comprising representations of terms in the first language and the second language;
(b) scanning a textual object in the first language to identify each occurrence of a term in the textual object in a record of the electronic dictionary;
(c) inserting each topical code associated with each of the records identified in step (b) into a data structure that provides for counting of the frequency of occurrence of each topical code; and
(d) whenever there occur a plurality of terms in the second language corresponding to a term in the first language, selecting the term associated with the most frequently occurring topical code.
In one embodiment of the invention, step (c) is performed by generating a table associating each topical code occurring in the textual object with its frequency of occurrence.
In another embodiment, as illustrated in the Examples below, step (c) is performed by the use of a map class.
In preferred embodiments of the apparatus and methods of this invention, one or more of the terms are represented in Unicode.
In the electronic dictionary of the present invention, each record may optionally include a representation of the part of speech for the first term, or for each term. However this is not a necessary field in the record.
Also, each record may optionally include a representation of the language of the first term and of the language of said second term. Alternatively a specific representation of the language (e.g. its name or a code such as JP indicating Japanese) may be omitted where an indication of the language is inherent in the structure of the record. (e.g. first field is always Japanese; second is always English).
The dictionary of the present invention is not limited to bilingual records, but may be generated with records that accommodate a third language, or any number of languages.
In preferred embodiments of the invention, the topical coding system is a hierarchical one, e.g. the International Patent Classification system.
Such embodiments are desirably used for selecting topic-appropriate translations of terms in a textual object in a first language into a second language by doing the following:
(a) providing an electronic dictionary containing records comprising representations of terms in the first language and the second language along with a topical code from a hierarchical system;
(b) scanning a textual object in the first language to identify each occurrence of a term in the textual object in a record of said electronic dictionary;
(c) inserting each topical code associated with each of the records identified in step (b) into a plurality of data structures that provide for counting of the frequency of occurrence of each topical code at a code at a plurality of levels of the hierarchy; and
(d) whenever there occur a plurality of terms in the second language corresponding to a term in the first language, selecting the term associated with the most frequently occurring topical code;
wherein steps (c) and (d) are applied iteratively, first at a coarser level of the topical code hierarchy, then at successively more detailed levels of the topical code hierarchy until topical ambiguities are either completely resolved or resolved to the extent allowed by the most detailed level of the hierarchy.
The multilingual electronic dictionary of this invention provides for automatic topic disambiguation by including one or more topic codes in definitions contained the dictionary.
A dictionary record according to this invention is part of a data structure contained in amachine-accessible memory. This record contains at least one topic code comprising the following items (with an example of a record for the Japanese term xe2x80x9csoshikixe2x80x9d as shown in Table 1):
Although the Japanese term is shown in the Tables herein in English characters within quotation marks, in the records of the present invention, a term in a language that uses other than English characters is preferably represented in a customary coding system such as Unicode. Alternatively all terms, optionally including the Topic Code, may for consistency be represented in such a coding system.
The electronic dictionary of the present invention is embodied as a data structure in any form of machine-accessible memory, which may be permanent or transient. For example, the data structure may be stored using means known in the art for digital or other discrete encoding that is readable to produce a physical signal responding to the contents of selected memory locations, as by electromagnetic or optical means. Various magnetic memories are well known, including fixed disc drives, removable diskettes, tape, and cards. Integrated circuit memory modules may also be used in the present invention, including those in a self-contained form such as PCMCIA cards. A dictionary of the present invention may desirably be stored in permanent form, as on CD-ROM or like media.
In accordance with the present invention, the dictionary may be accessed by a general purpose computer running an operating system (e.g. Windows, Mac OS, Unix, Linux, Pick, etc.) suitable to access the memory on which the dictionary is resident (either permanently or transiently) and including suitable application programming. For purpose of exemplification, programming in the C++ language is disclosed herein, but the reader will appreciate that any of a wide variety of programming languages or database applications may alternatively be employed, including, for example, Pascal, Fortran, COBOL, Eiffel, Java; Access, dBase, FoxPro, Paradox, and the like.
A dictionary of the present invention may be incorporated in a standalone, handheld unit, as an enhanced version of translators such as those currently available from Selectronics, Sony EB Electronic Book, and Franklin Computer Corporation. Alternatively, a dictionary and associated programming in accordance with the present invention may be stored on a single general purpose computer or distributed on a CD-ROM; or it may be made available as a service via a network of computers, e.g. an intranet, a wide-area network, or a global communications network such as the Internet.
Although the records illustrated in this disclosure include fields for Part of Speech, the reader should understand that an electronic dictionary of the present invention does not require information as to a term""s Part of Speech, and so the present invention may optionally be implemented without any such fields.
A dictionary record according to this invention may contain more than two languages. For example, it may also contain the term in German and French as shown in Table 2 below.
Alternatively, a record representing a term in more than two languages may be structured to include a field for a representation of the language for which each translation is provided. For example: xe2x80x9cshoshikixe2x80x9d/Japanese/tissue/English/Gwebe/German/tissu/French/A61K 47/38
A dictionary record according to this invention may contain more than one topic code, as own in the record in Table 3 below.
There are several topical code systems in existence that can be used in this invention. This can be a nonhierarchical system such as that disclosed in Appendix A of U.S. Pat. No. 5,873,056. However, hierarchical topical code systems are preferred.
There are several known hierarchical topical code systems that can be used. Examples include the International Patent Classification (IPC) codes, the United States Patent Classifications, the categories of Roget""s International Thesaurus(copyright), the Dewey Decimal System, and the Library of Congress Card Catalog Classification system. Other subject codes that may be used include the Longman subject codes disclosed in the Longman Dictionary of Contemporary English published by Longman Group UK Limited, Longman House, Burnt Mill, Harlow, Essex CM22JE, England.
For example, the IPC codes contain five levels of classification as illustrated in Table 4 below.
These levels in hierarchical topical code systems present the advantage of granularity. As is disclosed in the Examples below, if an ambiguity cannot be resolved at a shallow level, it may be resolvable at a deeper code level.
The topic codes for a given set of terms can be selected by locating a document that has previously been classified by topic and for which the source term and its translations are appropriate. For example, in translating a Japanese patent document into English, the main IPC code for that document can be used for Japanese-English term pairs that are encountered during translation.
According to this invention, these topic codes are used to determine a favored subject area within which to translate a particular term. Briefly, to determine the favored subject area, the topic codes for part or all of the terms in a particular block of text are counted. The block can be a sentence, a paragraph, a table, a set of text occurring within a certain number of bytes or words, a subdocument, an entire document or any other definable set of text in the source document.
There are several counting methods known to the art. For example, a two-column table comprising topical codes and their corresponding frequencies may be used. The preferred method is a xe2x80x9cmapxe2x80x9d as used in the programming language C++. Descriptions of xe2x80x9cmapxe2x80x9d may be found in Mark Nelson""s xe2x80x9cC++ Programmer""s Guide to the Standard Template Library,xe2x80x9d IDG Books Worldwide, Inc, Foster City, 1995, or in Microsoft Corporation""s xe2x80x9cMicrosoft(copyright) Visual Studio(trademark) 6.0 Development Systemxe2x80x9d, 1998 (hereinafter, xe2x80x9cMVSxe2x80x9d). To quote the latter:
xe2x80x9cThe template class describes an object that controls a varying-length sequence of elements of type pair less than const Key, T greater than . The first element of each pair is the sort key and the second is its associated value. The sequence is represented in a way that permits lookup, insertion, and removal of an arbitrary element with a number of operations proportional to the logarithm of the number of elements in the sequence (logarithmic time). Moreover, inserting an element invalidates no iterators, and removing an element invalidates only those iterators that point at the removed element.xe2x80x9d
In the current invention, the topical code or a substring generated from the topical code is assigned to the Key and inserted into the map. The second member of the pair less than const Key, T greater than  may be arbitrary as the map only needs to be used for counting the frequencies of the codes.
If a document has been assigned one or more topical codes, for example, IPC codes in the case of patent documents, these assigned codes can optionally be added to the counts of topical codes for the block of text being analyzed. It should be stressed that the use of such document-level assigned codes is optional and that the method of the current invention can be applied to any text.
While there are several coding systems that may be used in this invention, the preferred coding system is the Unicode(copyright) wide character set as described in Microsoft Corporation""s xe2x80x9cMicrosoft(copyright) Visual Studio(trademark) 6.0 Development Systemxe2x80x9d, 1998 as follows:
xe2x80x9cUnicode: The Wide Character Set A wide character is a 2-byte multilingual character code. Any character in use in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a wide character. Developed and maintained by a large consortium that includes Microsoft, the Unicode standard is now widely accepted. Because every wide character is always represented in a fixed size of 16 bits, using wide characters simplifies programming with international character sets.xe2x80x9d
A complete listing of the Unicode(copyright) codes preferred for this invention, and especially preferred for encoding Japanese, Chinese and Korean terms for this invention, can be found in The Unicode Consortium, xe2x80x9cThe Unicode Standard: Worldwide Character Encoding, Version 1.0, Vols. 1 and 2xe2x80x9d, Addison-Wesley, Reading, Mass., 1992.