Automatic speech recognition (“ASR”) systems convert speech into text. As used herein, the term “speech recognition” refers to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text). Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g. finding a podcast where particular words were spoken).
In converting audio to text, ASR systems may employ models, such as an acoustic model and a language model. The acoustic model may be used to convert speech into a sequence of phonemes most likely spoken by a user. A language model may be used to find the words that most likely correspond to the phonemes. In some applications, the acoustic model and language model may be used together to transcribe speech.
An ASR system may employ an ASR engine to recognize speech. The ASR engine may perform a search among the possible utterances that may be spoken by using models, such as an acoustic model and a language model. In performing the search, the ASR engine may limit its search to some subset of all the possible utterances that may be spoken to reduce the amount of time needed to perform the speech recognition.
Early ASR systems were limited in that they had a small vocabulary, could recognize of only discrete words, were slow, and were less accurate. For example, early ASR systems may recognize only digits or require the user to pause between speaking each word. As technology progressed, ASR systems were developed that are described as large-vocabulary, continuous speech recognition or LVCSR. LVCSR systems provided improvements over the early systems, including larger vocabularies, the ability to recognize continuous speech, faster recognition, and better accuracy. For example, LVCSR systems may allow for applications such as the dictation of documents. As the technology has improved and as computers have become more powerful, LVCSR systems have been able to handle larger and larger vocabularies and increase accuracy and speed.
As LVCSR systems increase in size, the ASR engine performing the search may have a more complicated task and may require a significant amount of computing resources and time to perform the search. The design of an ASR engine may allow for the search to be more computationally efficient and be more accurate.