The predominant approach in a speech translation system is the direct cascade of separate automatic speech recognition (ASR) and machine translation (MT) components, where the single ASR output as a word string is input to the MT system for translation. Spoken utterance translation (SUT) is a challenging task for machines to automatically convert speech input in one language to text output in another language. A straightforward approach to addressing this challenge is to build a two-stage system, which combines the state-of-the-art techniques from ASR and statistical MT (SMT). Specifically, in this two-stage system, the ASR engine first recognizes the speech input and outputs the recognition hypotheses in text. The SMT engine then takes the recognition hypotheses as input and outputs translated text in a specified target language. However, this has drawbacks because the errors made in ASR cannot be recovered in MT.