The present invention relates to learning relationships among words. More specifically, the present invention relates to a method of training a machine translator using bilingual text.
Machine translation is a process utilizing computer software and components to translate text from one language, such as German, French, or Japanese, into a second language, such as English, Spanish, or Arabic. Machine translation is anything but a straightforward process. Machine translation is not simply substituting one word for another, but is based upon knowing all of the words that comprise the give text, and how one word in the text influences other words in the text. However, human languages are complex and consist of several characteristics, such as morphology, syntax or sentence structure, semantics, ambiguities and irregularities. In order to translate between the two languages a machine translator must account for the grammatical structure of each of the languages. Further, it must use rules and assumptions to transfer the grammatical structure of the first language (source) into the second language (target).
However, given the complexities involved in languages, machine translation tends to be only between 30% and 65% accurate. Many phrases and colloquial terms do not translate easily. Attempts to translate the names of places, people, scientific words, etc. are made when they should not be translated. Rules which are hard-coded for certain grammatical features may always be applied, even though many exceptions to the rules exist, since writing code for all the exceptions would be a prolonged task, resulting in a slow translation process. So a document translated by current machine translation techniques may or may not even be understandable to a user; worse yet, some important elements of the document may be translated incorrectly.
Machine translators are only as good as the training data used to train the system. Machine translators are usually trained by using human authored translations. These translations are fed through a training architecture that identifies various pairs of words that are related. These word pairs are often the translations for the words in the text, but sometimes these words are not exact translations of the related words. Other machine translators are trained using data from a bilingual dictionary. However, training from these type of translations is not always the best way to train a machine translator, as the translations can lead the translator to chose the wrong word in a given circumstance.
One problem with using human authored translations to train a machine translator is that the translations are often not translations in the true sense of the word, but are more like interpretations of the text. For example, in Canada, parliamentary debates provide a ready source of human authored translated data that can be used to train a machine translator. However, these translations are often not true translations. Hence they do not provide training data to the machine translator to a level necessary to generate accurate translations.
The accuracy problem with machine translation can be explained by a simple-example. Using presently available machine translation, if a user was to translate a sentence from English to French, a certain degree of inaccuracy would be involved. In translating the sentence back to English using machine translation, the original translation inaccuracy is amplified, and the sentence will in most instances be different than the original English sentence. Take for example the following statement from a Canadian debate.                Mr. Hermanson: On a point of order, Mr. Speaker, I think you will find unanimous consent to allow the leader of the Reform Party, the hon. member for Calgary Southwest, to lead off this debate, and the hon. member for Red Deer would then speak in his normal turn in the rotation.Which was translated by a human translator into French as:        
M. Hermanson: J'invoque le Règlement, monsieur le Président. Je pense que vous trouverez qu'il y a consentement unanime pour que le chef du Parti réformiste, le député de Calgary-Sud-Ouest, engage ce débat et que le député de Red Deer prenne ensuite la parole quand ce sera son tour.
Which translates back to English as:
I call upon the requirement, Mr. President. I think that you will find that there is a unanimous consent to the proposition that the head of the reformist party, the member from Calgary-Southwest start this debate, and that the member from Red Deer makes his statement when it is his turn.
However, when translated back to English using a machine translator it becomes:                I call upon the Payment, Mr. President Président. I think that you will find that there is unanimous assent so that the chief of the Party reformist, the deputy of Calgary-South-West, engages this debate and that the deputy of Red Deer speaks then when it is its turn.        
As can be seen from the above example, the quality of a machine translation leaves much to be desired. The reliance on human authored translations tends to make the machine translator dependent upon interpretations as opposed to translations, when learning the relationship between words. Also there are a limited number of materials that are available for use as training data. (e.g. Bibles, debates at bilingual or multilingual organizations, and other documents that are specifically created in a bilingual format.) Further, to generate more translated documents to use for training a machine translator is an expensive process, that still does not provide enough accuracy to effectively train the machine translator. Therefore, it is desirable to train a machine translator with a large amount of translated data at a minimum of cost, while preserving or enhancing the accuracy of the machine translator.