Machine translation refers to the utilization of a computing device to translate text or speech from a source natural language to a target natural language. Due to complexities in natural languages, however, executing a translation between natural languages is a non-trivial task.
Generally, machine translation systems are learned systems that are trained through utilization of training data. Pursuant to an example, labeled training data can include word sequence pairs that are either translations of one another or are not translations of one another. These word sequence pairs are labeled and provided to a learning system that learns a model based at least in part upon this labeled training data. Thus, machine translation systems are trained in such a way that they can translate a word sequence not included in the training data by observing how translation between a source language and target language works in practice with respect to various word sequences in the training data.
It can be ascertained that acquiring more training data that can be utilized to train a machine translation system causes translations output by a machine translation system to be increasingly accurate. Several languages have a significant amount of training data associated therewith. Thus, many conventional machine translation systems are quite adept at translating between, for example, English and Spanish. For other languages, however, there is a lack of training data that can be utilized to train a machine translation system that is desirably configured to translate between such languages. In an example, a lack of training data exists that allows for machine translation systems to efficiently translate between German and Bulgarian, for example.
One manner for obtaining this training data is to have individuals that can speak both German and Bulgarian, for example, manually label word sequences (e.g., as being parallel to one another or not parallel to one another). This labeled data may then be used to train a machine translation system that is configured to translate between the German and Bulgarian languages, for instance. Manually labeling training data, however, is a relatively monumental task, particularly if a great amount of training data is desired.