The invention relates to techniques that provide information about expressions in one natural language, with the information being understandable in another natural language.
Various techniques have been proposed for providing information about expressions in a first natural language, with the information being understandable in a second natural language. For people who understand the second language but not the first, such information can produce an understanding of expressions in the first language. Some such techniques attempt to perform automatic translation, while other techniques provide machine aids for translation. Yet other techniques provide information without attempting translation.
Bauer, D., Segond, F., and Zaenen, A., xe2x80x9cLOCOLEX: the translation rolls off your tonguexe2x80x9d, in ACH-ALLC ""95 Conference Abstracts, Santa Barbara, Calif., Jul. 11-15, 1995, pp. 6-9, describe LOCOLEX, an intelligent reading aid that provides bilingual dictionary lookup through the interaction between a complete on-line dictionary and an on-line text. The user can click on a word in a sentence, and LOCOLEX uses the word""s context to look for multiword expressions (MWES) that include the word, to choose between parts of speech for the word based on parts of speech of neighboring words, and to exclude irrelevant information from the dictionary in order to focus the user""s attention on the best translation for better comprehension. The user can ask for more information about one meaning by clicking on it to get a usage example, one of the types of dictionary information that is not initially displayed. The lookup process after the user selects a word includes tokenization, normalization of each word to a standard form, morphological analysis, part of speech disambiguation, dictionary lookup, identification of MWEs, and elimination of irrelevant parts of the dictionary for display. To improve recognition of MWEs, they are encoded as regular expressions in a two-level rule formalism, and the rules are inserted into the relevant dictionary entries in place of the existing static text that are the normal forms of the MWEs. LOCOLEX could also make better use of the usage labels and indicators in the dictionary to filter out semantically inappropriate meanings of a word in a given context, such as by interactively asking the user to choose topics and then displaying translations associated with the chosen topics.
A prototype LOCOLEX system provides a user interface in which a user could click on a word about which the user desires information. In response, the system operates as described above to obtain and display a part of a dictionary entry relating to the word""s part of speech or to an MWE that includes the word. If the user feels the presented information is inappropriate, the user can click on the word again, and the system responds by displaying the complete dictionary entry.
U.S. Pat. No. 5,642,522 describes a context-sensitive technique for finding information about a word in an electronic dictionary. The technique maps the selected word from its inflected form to its citation form, analyzes the selected word in the context of neighboring and surrounding words to resolve ambiguities, and displays the information that is determined to be the most likely to be relevant. The user can request additional information, in which case either the next most relevant information or all information about the selected word is provided. The dictionary can include information about multi-word combinations that include the selected word, and content determination can include checking whether the selected word is part of a predefined multi-word combination. The technique could be used with a thesaurus or a dictionary, including a dictionary used for translation.
The invention addresses problems that arise with conventional techniques for providing information about expressions in one language, where the information is understandable in another language. The problems relate to complexity of sentences and other expressions whose meanings can depend on relationships, not only between individual tokens such as words, but also between groups of tokens. For example, the meaning of a sentence may depend on the relationship between a noun phrase and a verb or predicate phrase. Sentences and other expressions whose meaning depends on relationships between two or more groups of tokens are referred to herein as xe2x80x9cmulti-token expressionsxe2x80x9d.
Because of the various meanings an expression may have, a sentence that includes more than a few component expressions will have a very large number of possible translations, making it computationally difficult to consider all possible translations of the sentence. On the other hand, most sentences are neither identical nor nearly identical to a previously translated sentence, making it impractical in most situations to reuse previous translations of sentences. Conventional techniques thus tend to fall into three groups:
The first group includes automatic translation techniques, which attempt to translate sentences despite the computational difficulties of considering all possible translations. The techniques in the first group frequently make serious errors in translation, especially when applied to sentences that include expressions with multiple possible translations. The techniques in the first group typically discard information about ambiguities in order to obtain a translated result, making it impossible to recover from an error.
The second group includes machine aided translation tools and other techniques that find sentences that match or nearly match previously translated sentences, then retrieve the previous translation. Techniques in the second group are especially useful in the cases in which identical or nearly identical sentences occur frequently, but do not provide assistance the first time a new sentence occurs. Therefore, techniques in the second group are only useful in restricted contexts in which the chances of finding repetitions of similar sentences are reasonably high.
The third group of techniques does not attempt to automatically translate complete sentences. An example of this is LOCOLEX, described above, which displays a dictionary entry with one or more definitions of a selected expression, a technique analogous to an ordinary translating dictionary. Techniques in the third group can be quite helpful in aiding comprehension, but they have severe limitations. In LOCOLEX, for example, the user must have some knowledge of the first language in order to make appropriate selections In some cases, a dictionary entry does not provide sufficient information to allow the user to determine the meaning of a word or multi-word expression (MWE) in a specific context, much less the meaning of a sentence that includes the word or MWE.
The invention alleviates problems resulting from complexity of multi-token expressions, referred to herein as xe2x80x9ccomplexity problemsxe2x80x9d. The invention does so based on the surprising discovery that the meaning of a multi-token expression in a first language is often indicated by appropriately chosen translations of subexpressions into a second language, without attempting a complete, accurate translation of the multi-token expression.
The invention alleviates the complexity problem by providing techniques that rank a subexpression""s translation choices. The techniques use the ranked translation choices to produce a sequence of translation choices, to indicate in the second language the meaning of the multi-token expression.
The techniques also make it possible to consider a subset of possible translations of a multi-token expression, the subset that includes the translation choices that are likely for each subexpression. In other words, information can be automatically or interactively presented indicating more than one sequence of translation choices for the multi-token expression, allowing the user to find one which seems most likely to indicate the meaning of the multi-token expression as a whole. This is especially useful if the sequence that includes the highest ranked translation choice for each subexpression does not indicate a plausible meaning for the multi-token expression. The user can, in effect, build another sequence that seems more likely to indicate the meaning even though it includes translation choices that are less highly ranked. These choices can also be propagated through a text, consistently changing the translation for any particular token or multi-token subexpression. They may also be captured to enable the system to provide better choices for future texts.
The techniques can be implemented in a method that obtains subexpressions of a multi-token expression in the first language. The method can obtain translation choices in the second language for a set of the subexpressions. Two or more translation choices can be for one subexpression. The method can rank a subset of the subexpression""s translation choices, and can then use the ranked translation choices in producing a sequence of translation choices for the multi-token expression as a whole. The method can present information about the sequence to a user, indicating in the second language the meaning of the multi-token expression.
In obtaining subexpressions, the method can tokenize the multi-token expressions, and the subexpressions can include the resulting tokens. In obtaining translation choices, the method can then obtain a normalized form for each token and can then obtain one or more translation choices for the normalized form of each token. If a token is a multi-word expression, then method can use a multi-word expression lexicon to obtain translation choices. The method can obtain an inflection tag indicating a relation between a token and its normalized form and can use the inflection tag and normalized forms of translation choices to obtain surface forms of the translation choices, even for translation choices that are multi-word expressions.
In obtaining translation choices, the method can also obtain a set of alternative part-of-speech tags for each token, and use them to obtain a single part-of-speech tag for each token, such as by disambiguation. Then the method can use the single part-of-speech tags to group the tokens into token group subexpressions, analogous to phrases. The method can also obtain token group tags indicating types of token group subexpressions. The token group tags can be used to reorder translation choices within the portion of the sequence obtained for a multi-word expression or to reorder the portions of the sequence for two multi-word expressions.
Part-of-speech tags could also be used to rank translation choices. Translation choices could alternatively be ranked according to source, based on semantic context, or based on confidence levels that depend on how each translation choice was obtained. Ranking of translation choices could be automatic, or could be performed in accordance with a user signal selecting a ranking process.
The method can also search a translation memory for a previous translation of the multi-token expression, and can present information about the sequence of translation choices only if the search does not find a previous translation.
The method can present the sequence on a display. In response to a user signal selecting the translation choice presented for a subexpression, the method can produce and present a modified version of the sequence, with a different translation choice for the subexpression. If a user signal indicates a modification of the ranking of a subexpression""s translation choices, and if the multi-token expression is an input text that includes a similar subexpression, the method can also modify the ranking of translation choices of the similar subexpression in accordance with the user signal. Where the method will be performed on a subsequent multi-word expression that includes a similar subexpression, the method can obtain a descriptor of the modification for use in modifying the ranking of the similar subexpression. The descriptor could, for example, indicate a rule applicable to the subexpression when it occurs in context similar to its context in the multi-token expression. Or, where ranking is based on confidence levels, the descriptor could indicate a weighting of confidence levels of translation choices.
The method can print the sequence of translation choices. The printed sequence can include the ranked set of translation choices for a subexpression.
The techniques can also be implemented in a machine with a processor that can access the multi-word expression and can provide information to user output circuitry for presentation to users. The process can operate to obtain subexpressions, obtain translation choices, rank translation choices, produce a sequence of translation choices, and provide information to the user output circuitry, as described above.
The techniques can also be implemented in an article of manufacture for use in a system that includes a multi-token expression, a storage medium access device, a processor, and user output circuitry. The article can include a storage medium and instruction data stored by the storage medium. The processor can receive the instruction data from the storage medium access device and, in executing the instructions, can operate as above.
The techniques can also be implemented in a method of operating a first machine to transfer instruction data as described above to the memory of a second machine that also includes user output circuitry and a processor. The processor of the second machine, in executing the instructions, can similarly operate as above.
In comparison with conventional machine translation techniques, the techniques provided by the invention are advantageous because they avoid the computational difficulties involved in translating entire sentences and other multi-token expressions. In addition, the inventive techniques can be extended to preserve sufficient ambiguity by saving two or more translation choices, and can therefore present two or more choices or allow a user to build a more likely translation of a multi-token expression using alternative translation choices.
In comparison with LOCOLEX and similar techniques, the techniques provided by the invention are advantageous because they can handle an entire multi-token expression rather than only a word or short multi-word expression, providing a sequence of translation choices that indicates the meaning of the multi-token expression. In addition, the inventive techniques can be extended to inflect or reorder translation choices, increasing the readability of a presented sequence of translation choices. The inventive techniques are not limited to obtaining translations from dictionaries, but can also be combined with conventional translation memory techniques to provide an alternative when a previous translation cannot be found. It may be possible to extend the inventive techniques to enable a person who is unfamiliar with a source language but has a strong facility in a target language to produce a reasonable translation of an input text in the target language.
The techniques provided by the invention are also advantageous because they can be extended to capture information about user choices and so be able to continually improve overall system performance. The techniques can also be extended to identify the domain in which user choices are made, allowing domain specific corrections.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.