Automatic speech recognition (“ASR”) systems convert speech into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text). Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g., finding a podcast where particular words were spoken).
In converting audio to text, ASR systems may employ models, such as an acoustic model and a language model. The acoustic model may be used to convert speech into a sequence of phonemes most likely spoken by a user. A language model may be used to find the words that most likely correspond to the phonemes. In some applications, the acoustic model and language model may be used together to transcribe speech.
An ASR system may employ an ASR engine to recognize speech. The ASR engine may perform a search among the possible utterances that may be spoken by using models, such as an acoustic model and a language model. In performing the search, the ASR engine may limit its search to some subset of all the possible utterances that may be spoken to reduce the amount of time and computational resources needed to perform the speech recognition.
Early ASR systems were limited in that they had a small vocabulary, could recognize only discrete words, were slow, and were less accurate. For example, early ASR systems may recognize only digits or require the user to pause between speaking each word. As technology progressed, ASR systems were developed that are described as large-vocabulary, continuous speech recognition or LVCSR. LVCSR systems provided improvements over the early systems, including larger vocabularies, the ability to recognize continuous speech, faster recognition, and better accuracy. For example, LVCSR systems may allow for applications such as the dictation of documents. As the technology has improved and as computers have become more powerful, LVCSR systems have been able to handle larger and larger vocabularies and increase accuracy and speed.
As LVCSR systems increase in size, the ASR engine performing the search may have a more complicated task and may require a significant amount of computing resources and time to perform the search. Certain devices, such as smartphones, tablet computers, and others, may lack the computing resources necessary to perform LVCSR or other types of ASR. It may also be inconvenient or inefficient to provide the hardware or software necessary for ASR on certain devices. As an alternative to implementing ASR directly on unsuitable devices, an ASR engine may be hosted on a server computer that is accessible via a network. Various client devices may transmit audio data over the network to the server, which may recognize any speech therein and transmit corresponding text back to the client devices. This arrangement may enable ASR functionality to be provided on otherwise unsuitable devices despite their limitations.
As mentioned above, ASR may be computationally intensive. However, the amount of computational resources required for ASR may also vary substantially depending on the content of the speech that is being recognized. For example, certain words or phrases may require substantially more resources to recognize than others. Accordingly, a server with sufficient resources to perform ASR on the most complex parts of an audio stream will be underutilized when working on less complex parts. To avoid underutilization, particularly in hosted environments of the sort described above, a single server may be used to perform ASR on several audio streams at once. However, when one or more of the streams becomes too complex, the server will be overloaded and unable to keep up. Of course, overloading may be avoided by adding additional servers and configuring each one to handle fewer audio streams, but then many of the servers will end up being underutilized once again.
Additional background information on ASR systems may be found in L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliff, N.J., 1993; F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997; X. Huang, A. Acero, H.-W. Hon. Spoken Language Processing, Prentice Hall, 2001; D. Jurafsky and J. Martin, Speech and Language Processing (2nd edition), Prentice Hall, 2008; and M. Gales and S. Young, The Application of Hidden Markov Models in Speech Recognition, Now Publishers Inc., 2008; which are incorporated herein by reference in their entirety.