The present invention relates to automatic translation systems. In particular, the present invention relates to translation identification using non-parallel corpora.
In translation systems, a string of characters in one language is converted into a string of characters in another language. One challenge to such translation systems is that it is difficult to construct a dictionary that can provide a translation for every word in the source and target languages. One reason for this is the number of words in the languages, which makes it labor intensive to create such a dictionary. Another reason is that new words are constantly being added to the languages, requiring a large amount of work to keep the dictionary current. The lack of available translations is particularly a problem for multi-word phrases such as the noun phrases “information age” or “information asymmetry” since there are a large number of such phrases and because new phrase are continually being created.
To overcome the work involved in building and updating translation dictionaries, several systems have been created that automatically generate a translation dictionary. Under one set of systems, the translation dictionary is formed using parallel bilingual corpora. In such systems, the same information is written in two different languages. The text in one of the languages is aligned with the text in the other language, typically on a sentence-by-sentence basis. After this alignment is complete, comparisons between the aligned texts are made to identify words that are likely translations of each other.
Although using parallel corpora is an effective technique, obtaining such corpora is difficult in practice. To deal with this difficulty, some systems have proposed using non-parallel corpora. Under such systems, a set of candidate translations are assumed to be given or can be easily collected. The goal of the systems is to select the best candidate from the set of candidates.
To do this, the systems rely on a linguistic phenomenon in which the contexts of a translation for a word in the target language are the same as the contexts of the word in the source language. Thus, these systems identify the best candidate by translating the contexts in the source language into the target language and selecting the candidate translation that has a target language context that best matches the translated context. In one system, the contexts are represented by vectors where each element in the vector represents a word in the context.
One problem with such systems is that they are dependent on an accurate translation of the contexts. In many systems, it is assumed that there is a one-to-one mapping between context words in the source language and context words in the target language and as such, an accurate translation can be achieved by consulting a translation dictionary. However, in reality, there is a many-to-many relationship between words in a source language and words in a target language. As a result, each word in the source context can have multiple translations in the target language. In addition, words in the target context can have several different translations in the source language.
Thus, a system is needed that provides for accurate translations of the contexts while taking into account the many-to-many relationship between words in the source and target languages.
In addition, since all automatic translation dictionary systems are prone to error, it is desirable to develop a system that limits the number of incorrect translations that are entered into the dictionary.