Practical application of a speech translation system is progressing. For example, NTT DOCOMO, INC. started a speech translation service such as “hanashite honyaku” in 2012. This service supports not only face-to-face speech translation service but also non-face-to-face speech translation service. In the face-to-face speech translation service, two users commonly utilize one speech translation terminal, and conversations of the two users who are mutually facing are subjected to speech translation. Furthermore, in the non-face-to-face speech translation service, dialogues of two users who are remotely located by a call device such as a telephone are subjected to speech translation.
In the face-to-face speech translation service, on a speech translation terminal commonly owned by two users, an utterance start button and an utterance completion button are prepared for respective languages of the two users. When two users who speak different languages (For example, Japanese, English) converse in respective language, after pushing the start button, the two users utters in the respective languages. Then, when a first user of the two users completes the utterance, the first user pushes the utterance completion button. Here, instead of the utterance completion button, the first user's utterance may be automatically completed by detecting a silent interval.
As a result, on a screen of the speech translation terminal, a speech recognition result and a translation result are displayed as character strings. Furthermore, the translation result is outputted as a speech via a speaker of another party (a second user of the two users). Now, the second user who watched the screen utters by operating in the same way. Here, this translation result is outputted via a speaker of the first user. Thus, by repeating similar operations, the two users can converse via the speech translation terminal.
In the non-face-to-face speech translation service, the first user A operates such as [pushing the utterance start button]→[uttering]→[pushing the utterance completion button]. In this case, the second user B (another party) can hear via a telephone such as [notification sound “Pi!” by the user A's pushing the utterance start button]→[the user A's utterance]→[notification sound “Pi!” by the user A's pushing the utterance completion button]→[speech of translation result]. Then, by mutually repeating this operation, conversation by speech translation can be performed.
In this speech translation apparatus, after one user's utterance is completed, the speech translation result is outputted via a display or a speaker. Accordingly, in comparison with communication by regular conversation, it takes a long time for the other user to understand the one user's intension.
In order to solve this problem, face-to-face simultaneous translation system is proposed in following references.
(Reference 1) JP Pub. No.2002-27039
(Reference 2) “Evaluation of a Simultaneous Interpretation System for Continuous-Speech Conversation”, Information Processing Society of Japan (IPSJ) SIG technical reports, 2013-HCI-151 (17), 1-99, 2013-01-25
In the face-to-face simultaneous translation system, while two users are uttering, a translation unit is automatically detected therefrom, and the translation result is displayed by following the user's utterance. In this case, without waiting completion of the user's utterance, the translation result is notified. As a result, time necessary for one user to understand another user's intention is reduced, and the users can communicate smoothly.
In the simultaneous speech translation system, face-to-face speech translation service is imagined. Even if the translation result is consecutively displayed while uttering, no problems occur. However, in non-face-to-face speech translation service to which the simultaneous speech translation system is applied, when a speech of the translation result (consecutively translated) is outputted by overlapping with an original speech of a speaker (user A), it is hard for a listener (user B) to hear the speech of the translation result.
In order to solve this problem, if the speech of the translation result is outputted after the speaker's utterance is completed, the listener easily hears the speech of the translation. However, in this method, it takes a long time for the listener to understand the speaker's intension. As a result, communication between users A and B cannot be smoothly realized.
Furthermore, a method to avoid overlap of speeches by outputting not the speaker's original speech but the speech of the translation result can be considered. In this method, the listener hears only the speech of the translation result without the speaker's original speech. Accordingly, it is hard for the listener to synchronize the utterance timing. For example, when the speech of the translation result is paused, two cases are considered. As a first case, after the speaker's utterance is completed, it is under a condition to wait the listener's utterance. As a second case, the speaker is continually uttering with a pause. As to two users (speaker and listener) who are remotely located, it is difficult for them to understand their utterance turn in above two cases. As a result, their conversation is not smooth.