In recent years there has been immense growth in the information and services available electronically, as evidenced in the widespread use of the Internet. Typically, a user interacts with the system by typing information, for example using a keyboard or touch screen, and viewing the results on a video display. However, conversational systems are increasingly available that allow the user to input information verbally. The system output may be provided aurally to the user. Such conversational systems permit the user to readily obtain information and services while on the move, leaving the user's hands free for other tasks.
Conversational systems require speech recognition to understand the user and speech synthesis to render information in a human-like voice. Typically, such systems execute in a telephony infrastructure where the client devices are telephone instruments such as mobile handsets. Originally, such conversational systems worked with dumb client devices and consequently all of the speech processing (recognition and synthesis) was performed in servers communicating with the dumb clients. However, the increase in processing capability of handheld clients has made speech processing (both recognition and synthesis) possible at the client side.
In some conversational systems part of the speech recognition is processed on a client device. The term ‘Distributed Speech Recognition’ is used to refer to systems that allow applications to combine local speech processing on the client device with remote access to network-based speech services. For example, signal processing such as noise reduction may be executed on the client device, which then sends the processed data to a network-based speech service. In turn, the speech service processes the received signal to determine the user's request and responds to the user using a voice output.
Another known technique that uses the processing capability of the client is Embedded Concatenative Text-To-Speech (eCTTS), in which part of the speech synthesis is done at the client. Speech segments are kept as compressed feature vectors which may be reconstructed back to speech.
In another known approach, a conversational system may reside completely on the client and the entire speech recognition process is performed locally. Since clients typically have a limited processing capacity, only very small conversational systems can be executed on such devices.
Notwithstanding the existing art, there is an ongoing need for more efficient and versatile systems for processing voice applications.