1. Technical Field
The present invention relates to translating speech from one language to another, and more particularly, to a system, apparatus and method for simulating user actions and responses for rapid automatic user training during the use of a speech-to-speech translation system.
2. Description of the Related Art
Modern speech-to-speech (S2S) translation systems attempt to enable communications between two people that do not share the same language. To recognize the speech in one language and transform the language into the speech of another language, advanced technologies such as automatic speech recognition, machine translation, text-to-speech synthesis and natural language processing are integrated within a user interface that facilitates the multilingual communication between two speakers. The resulting system as well as its user functions is usually so complicated that it is very difficult to employ the system. In addition, it is even more difficult to master the operational functions properly for beginner users without sufficient training.
Modern speech-to-speech recognition systems aim toward facilitating communications between people speaking different languages. To achieve this goal, a typical speech translation system (1) collects the speech signal from one speaker, (2) recognizes the speech in the source language, (3) translates the recognized messages into the target language, (4) synthesizes the speech sound of the translated sentence, and (5) plays it by way of a speaker. Steps (2), (3) and (4) are commonly realized by the techniques of automatic speech recognition (ASR), machine translation (MT) and text-to-speech synthesis (TTS), respectively.
One issue for the success of a speech-to-speech recognition system is whether a new user can be trained to operate the system as well as all the ASR, MT and TTS functions properly and how soon the user can be trained to do so. Current speech-to-speech translation systems often provide two types of user support: material based learning and human aided learning.
With the first type, a new user is provided a set of text-based user manuals and/or well-designed video materials in the hope that he/she can figure out how to use the system properly, at almost no cost. Since the training procedure is not interactive and there is no foreign speaker available, the resulting learning cycle is usually very lengthy, ineffective and frustrating. In the second type of user support, a new user is given lessons by multilingual instructors and has the opportunity to practice using the system with a bilingual speaker. The resulting learning cycle is usually significantly shorter than the former type of user support and much more effective, but at dramatically higher cost. More importantly, it is often very difficult for a new user to find a bilingual speaker to practice the system out of training classes.
Current user training methods are mostly via text-based user manuals or pre-recorded video materials, which are often not easy to understand or interactive. Moreover, the user training of a multilingual system needs multilingual speakers. To practice using these systems, two users are required. One speaks the native language such as English and the other speaks a foreign language such as Chinese.
In reality, foreign speakers are difficult to find and expensive when a user tries to learn using a new speech to speech (S2S) translation system. Alternatively, a monolingual user can use the system in his/her native language, which will greatly increase system learning time and significantly limit the functions that can be practiced. This results in reduced user satisfaction.