Recently, with globalization of culture and economy, a speech translation apparatus to support communication among persons having different native languages is highly expected. For example, speech translation-application software operated with smart phone is commercialized. Furthermore, a service to present speech translation-function is used.
In these application software and service, when a user utters a speech of first language in a short unit (one sentence or several sentences) toward the speech translation apparatus, this speech is converted to a character string corresponding thereto by a speech recognition function. Furthermore, this character string of a first language (source language) is translated into a character string of a second language (target language). Last, this character string as a translation result is read aloud with speech of the second language by a speech synthesis function. Here, a user having the first language (source language) is required to utter in a short unit. On the other hand, a user having the second language (target language) is required to confirm the translation result in the short unit and to hear the synthesized speech. Accordingly, in conversation using such application software, wait time is frequently occurred. As a result, conversation with high responsiveness is hard to be performed.
Furthermore, as to the user, without restriction to request to utter in one sentence, content of conversation is desired to be communicated to the other party. However, such function is not provided yet.
Furthermore, in speech recognition or speech synthesis, physical signal of speech (such as speech input and speech output) is processed. Accordingly, a physical duration of speech becomes restriction of processing time. This restriction is regarded as a reason to delay responsiveness of interaction in conversation via the speech translation apparatus.
FIG. 14 shows a time relationship between the user's utterance (into the speech translation apparatus) and a speech output of the translation result therefrom, after conventional speech input is completed.
In FIG. 14, a horizontal axis represents a time transition. While a user A is uttering with the first language (t0˜t1), this speech is captured (S900). After timing when the utterance is completed, the speech recognition result is fixed and outputted (S910). This speech recognition result is inputted and translated into the second language understandable for a user B (S920). This machine translation result is inputted and synthesized as a speech of the second language (S930). At timing (t2) when the speech synthesis result is obtained, the synthesized speech is started to be outputted to the user B, and machine translated speech is outputted (S940). Accordingly, while the user A is uttering (t0˜t1), the speech is not outputted to the user B from the speech translation apparatus. At time t2, the user B can hear the translation result for the first time.
On the other hand, while the speech is being outputted to the user B (t2˜t3), a speech is not outputted to the user A from the speech translation apparatus. This operation hinders conversation between users mutually located at a remote place unable to directly hear respective speeches. For example, when the user B utters during the user A is uttering, or when the user A utters during the speech is being outputted to the user B, collision is occurred in their speeches.
On the other hand, in order to confirm whether the machine translation is correctly performed, a speech translation system to back translate the machine translation result is proposed in JPA (Kokai) PH04-319769. In this reference, after the user A's speech input (the first language) is recognized, this recognition result is translated into the second language by machine translation function. This machine translation result is back translated into the first language, and the user A confirms whether this back translation result is correct or not. After this confirmation, a synthesized speech of this machine translation result is outputted to the user B. However, in this reference, each step (speech input, machine translation, back translation, speech synthesis) is executed in order. Accordingly, a wait time occurs whenever each step is executed. As a result, speech dialog cannot be smoothly performed between users.