Speech-to-speech (S2S) translation systems utilize various components to receive spoken/audible input in a source language, and provide synthesized audible output in a target language. Examples of some of the main components include an automatic speech recognition (ASR) engine to convert audio input into text-based output in the source language, a machine translation (MT) engine to translate the source language text output by the ASR engine into text-based output in the target language, and, in some cases, a text-to-speech (TTS) engine to convert the target language text output by the MT engine into synthesized audio output in the target language.
In an S2S translation system, the translation quality of the MT engine depends on the data it uses in training (i.e., training data). Current translation systems implement MT engines that are trained on highly edited text corpora, which are not suitable for translating spontaneous speech. That is, the data that is output by the ASR engine is typically conversational and disfluent, whereas the edited, written training data used to train the MT engine is typically formal and fluent. This leads to a significant mismatch between the output of the ASR engine and the input expected by the MT engine, thus hindering the MT engine's ability to output an accurate translation for a given utterance received by the ASR engine. This, in turn, leads to providing poor translations that are ultimately output in the target language to an end user. Moreover, there are few corpora of spontaneous speech paired with text translations in a target language that could otherwise be used for training an MT engine on ASR output, and it is impractical to manually create a sufficient volume of such training data. Thus, MT engines remain poorly trained for implementation within a S2S translation system.