1. Field of the Invention
The present invention relates to speech and audio processing and more specifically to spoken language translation.
2. Introduction
Current speech-to-speech translation approaches predominantly rely on a pipeline model consisting of several black box steps. Some sample applications of speech-to-speech translation are a mobile device in to which a user speaks in one language, such as English, and the speech is translated into another spoken language, such as Korean. The first step of the traditional speech-to-speech translation approach is to transcribe the source language speech into text using a speech recognizer. Typically, the top-best ASR hypothesis text is considered for machine translation. The second step is to translate the text via machine. The third step after translation is to synthesize the text into speech in the target language. Such an approach discards the rich information contained in the source speech signal that may be vital for meaningful communication. It is well known that prosodic and affective aspects of speech are highly correlated with the communicative intents of the speaker and often complement the information present in the lexical stream. Disregarding such information often results in ambiguous concept transfer in translation, which is a significant problem in the art. For example, even the best of current speech translation approaches may provide improper utterance chunking, erroneously emphasizing a word or phrase in the target language. In other cases, key contextual information such as word prominence, emphasis, and contrast can be lost in the speech-to-text conversion.
Prosodic information has been used in speech translation but mainly for utterance segmentation and disambiguation. The VERBMOBIL speech-to-speech translation system utilizes prosody through use of clause boundaries, accentuation and sentence mood for improving the linguistic analysis within the speech understanding component. The use of clause boundaries improves the decoding speed and disambiguation during translation. More recently P. D. Aguero, J. Adell, and A. Bonafonte have proposed a framework for generating target language intonation as a function of source utterance intonation. They use an unsupervised algorithm to find intonation clusters in the source speech similar to target speech. However, such a scheme assumes some notion of prosodic isomorphism either at word or accent group level.
Accordingly, what is needed in the art is an improved way to preserve and use the prosodic information throughout the process of speech-to-speech translation.