The present invention relates generally to speech recognition, and more particularly to recognition of speech received by mobile devices.
Mobile devices are becoming indispensable front-end to many speech applications. At the present time supporting speech recognition/understanding applications on a mobile device can take one of the following three forms.
Stand-alone Speech Recognition: Host all the speech to text software components on the mobile device. An example of such form is the one used by mobile phones and other mobile communication devices to enable voice dialing, etc. Due to mobile phone computational resource limitations such form is unable to provide reasonable performance even for the smallest speech utterance under the quietest operational conditions. Expanding or increasing the device resources to improve performance makes the device bulkier, more expensive, and causes the device to be less power efficient.
Remote Speech Recognition: To overcome mobile device resource limitations as a barrier to achieve high level of speech to text accuracy, the components of the speech recognition are hosted on a large server infrastructure that is reachable via a wireless network. The microphone on the mobile device is used to capture the user's spoken utterance. The voice utterance is coded and compressed and transmitted wirelessly to a server who applies a sequence of signal processing and speech recognition tasks to convert the speech signal into text. This approach is prone to the carrier signal quality such as signal fading, deterioration in speech signal caused by transmitter encoding, etc. Furthermore, transmitting speech signal over the wireless can be time consuming and air-time costly.
Distributed Speech Recognition: It is a hybrid of the two above forms where the recognition process is divided into two functional parts: a Front-End (FE) on the device and a Back End (BE) on a server. The Front-End transforms the digitized speech as a stream of feature vectors. These vectors are then sent to the Back-End via a data transport which could be wireless or wired. The recognition engine of the Back-End matches the sequence of input feature vectors with references and generates a recognition result. Again, while it is possible to having a local speech recognizer on future mobile devices, at present this would be a substantial additional cost due to processing power and memory restriction in current mobile devices. This issue is overcome by some current solutions by placing the computational and memory intensive parts at a remote server. Although that DSR both reduces the bandwidth requires and minimizes the distortion coming from the transmission errors over the cellular network, it still possible for distortion due to data truncation in the digitization and transportation process.