1. Field of the Invention
The present invention is related to Automatic Speech Recognition (ASR) and more particularly to improving automatic speech recognition accuracy on networked computer systems such as in a cloud based system.
2. Background Description
A typical state of the art automatic speech recognition (ASR) system uses an acoustic model to map utterances to specific phones or words, and a language model to determine the likelihood of the mapped words in a string of words. Using this information, the ASR system searches for the pattern (word or sentence) with the highest probability to match each utterance. An adaptation component takes the recognized utterance, the word sequence and the extracted features from the speech signal, and updates the acoustic model from it.
While speech accuracy of state of the art speech recognition systems is, generally, relatively high, prior speech recognition systems still make frequent errors from inappropriate recognition. To reduce errors and improve accuracy, single user systems train with the user, using acoustics, language, and/or grammar based training or external commands/input to hone the acoustic and language models. This training may be further supplemented with local context information, such as, for example, information gleaned from dialect detection and background noise removal.
While a single user system may improve recognition with additional training, there are many existing or potential multi-user ASR applications in more general situations. Existing ASR systems fall short of the precision required for these situations, which include, for example, call centers, emergency coordination centers and closed captioning for live television broadcasts. Extensive training may be unfeasible for these multi-user applications. Other approaches to improving ASR accuracy for multi-user use have relied either on further speech signal processing or on modifying acoustic/language model parameters to accommodate to a particular speaker or environmental conditions. Regardless of whether additional training, speech signal processing and/or acoustic/language model parameter modification is used to improve accuracy, however, these state of the art approaches have required substantial resources, e.g., time, processing power and storage space.
Moreover, the demand for automatic speech recognition reaches beyond the typical desktop computer to lightweight, handheld mobile devices. Mobile computing devices can be characterized as lightweight and a small form factor or footprint and have become ubiquitous. Mobile phones, tablets and smart phones, for example, are typical such mobile computing devices. With each generation, these mobile devices have become more and more user friendly, providing users with more and more features and function. However, making these devices user friendly has meant constraining the device architecture, reducing processing power (i.e., using power efficient, low power processors) and limiting power supplied, to fit the small footprint and provide hours of service on a compact, lightweight power cell or battery.
Automatic speech recognition would have natural application to smart phones, for example, both to facilitate users interfacing with and to expand the capability of, the device. Providing more sophisticated capability, such as automatic speech recognition, in a mobile device is at odds, however, with the size and power constraints in the same low-power, lightweight device and in fitting the devices in a small footprint. A typical state of the art mobile device may be capable of very rudimentary speech recognition, such as name or number recognition for voice dialing or menu navigation. However, these mobile devices do not have sufficient capacity for more robust speech recognition. So, while the added flexibility of mobile devices may open the door for more broad application of ASR, state of the art mobile devices have not be capable of taking advantage of those more broad applications.
Recently, when local computing capability has fallen short, excess load has been off-loaded to a shared Information Technology (IT) Infrastructure, typically called the cloud, to meet the shortfall. Cloud service providers (e.g., providing capability on mainframe computers, servers or other cloud computers) share resources and services, and handle data-intensive computing tasks that might otherwise exceed the local capacity or capability, or that might run intolerably slow on local facilities. Because mobile devices and other low end computing, in particular, may be inadequate for automatic speech recognition, much of the ASR load has been off-loaded to the cloud, e.g., for processing a captured speech stream or a representation thereof to generate text.
However, even with extensive training these state of the art approaches to improving recognition (training, local context and cloud computing), have still fallen short of reaching a level of precision suitable for natural speech communication. extensive training is unfeasible in scenarios, such as, for example, call centers, live TV broadcasts, and emergency coordination centers. Especially in those scenarios, these state of the art approaches have been inadequate.
Thus, there is a need for enhanced automatic speech recognition, in particular for mobile devices and for sustained voice to machine interaction; and especially for improved recognition where extensive training is unfeasible.