1. Field of the Invention
The present invention generally relates to a method and apparatus for building a statistical machine translation (SMT) system, and more specifically to a SMT system for data collection and acquisition that utilizes a translation game played by multilingual people.
2. Description of the Related Art
Conventional statistical machine translation (SMT) systems rely on manually translated bilingual data where a given sentence/phrase in the source language is translated to a target language. Translation of sentence pairs is the most time-consuming part of building a conventional SMT system, as the translation of sentence pairs depends on human labor. At best, only a few human translators are available to translate large quantities of data for the translation of sentence pairs. Furthermore, a lack of translators can be a bottle neck in translation data collection because there are too few people to translate the sentences. In addition, for some languages, it is conventionally difficult to find bilingual speakers in some exotic languages because there are few of them available.
In conventional phrase based statistical machine translation (SMT) systems, estimates of conditional-phrase-translation probabilities are the major source of translation knowledge. The phrase pair extraction is based on an automatically word-aligned corpus of bilingual sentence pairs. In conventional phrase based SMT systems, every possible phrase pair, up to a pre-defined phrase-length with the following constraints, are extracted: phrases must contain at least one pair of linked words and phrases that must not contain any words that have links to other words not included in the phrase pair.
The practical issues for developing a parallel corpora include: 1) lack of experienced bilingual speakers in the language-pair of interest, 2) costs associated with translating each sentence, and 3) time required to translate these sentences. These issues have a major impact on the development cycle of conventional SMT systems.