Against the background of recent significant progress in techniques such as speech recognition, machine translation, and speech synthesis, speech translation systems, which are a combination of these techniques, have been put into practical use. In such systems, an input in a first language is converted into a text in the first language by speech recognition technique. Further, the text in the first language is translated into a text in a second language by machine translation, and then is converted into a speech in the second language by a speech synthesis module corresponding to the second language. The practical application of this technique will eliminate the language barrier, thus allowing people to freely communicate with foreigners.
At the same time, in addition to auditory information from the ears, visual information from the eyes such as facial expression and gesture can greatly contribute to the transmission of meaning. For example, a gesture such as “pointing” can greatly contribute to the understanding of meaning. Thus, the motion of the speaker is transmitted to the listener through an image or a robot, to achieve more natural communication. For example, Patent Literature 1 has proposed reproduction of the motion of the speaker through a robot.
However, in a speech translation system, when a first language speech is translated into a second language speech, it is difficult to guarantee that a word with the same meaning always comes at the same time (the time relative to the beginning of the speech). Thus, mismatch (hereinafter, referred to as “time lag”) occurs between the visual information from the eyes of the listener and the auditory information from the ears of the listener, which may significantly impair understanding of the meaning.
The conventional method (Patent Literature 2) proposes a method in which the start time and end time of the second language speech are adjusted to the first language speech, which makes it possible to synchronize between the speech and the image at the start and end times. However, the problem of local time lag between visual information and auditory information remains unsolved. In particular, in the case of translation of Japanese and English, the time lag due to the difference in the order of words is significant, and this may lead to misinterpretation.
For example, it is assumed that a speaker points (gesture 1) a can 1 (first spoken can) while speaking “Put this can into this can.” in a first language (English), and then makes the motion to pointing (gesture 2) a can 2 (second spoken can). At this time, the temporal correspondence between the sound of the can 1 and the gesture 1, and the sound of the can 2 and the gesture 2 significantly contributes to the listener's understanding of the meaning. However, when it is translated into “Kono kan ni, kono kan wo irete kudasai.” for a native speaker of a second language (Japanese), the order of the can 1 (first spoken can) and the can 2 (second spoken can) is reversed from the first language, so that the temporal correspondence is between the sound of the can 2 and the gesture 1, and the sound of the can 1 and the gesture 2. As a result, the meaning is reversed from what the speaker intended to say.