Automatic speech recognition (“ASR”) systems generally can convert speech into text. As used herein, the term “speech recognition” refers not only to the process of converting a speech (audio) signal to a sequence of words or a representation thereof (text), but also to using Natural Language Understanding (NLU) processes to understand and make sense of a user utterance. Speech recognition applications that have emerged over the last few years include voice dialing (e.g., “Call home”), call routing (e.g., “I would like to make a collect call”), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), and content-based spoken audio searching (e.g., finding a podcast where particular words were spoken).
An ASR system may employ an ASR engine to recognize speech. The ASR engine may perform a search among the possible utterances that may be spoken by using models, such as an acoustic model and a language model. In performing the search, the ASR engine may limit its search to some subset of all the possible utterances that may be spoken to reduce the amount of time and computational resources needed to perform the speech recognition.
Early ASR systems were limited in that they had a small vocabulary, could recognize only discrete words, were slow, and were less accurate. For example, early ASR systems may recognize only digits or require the user to pause between speaking each word. As technology progressed, ASR systems were developed that are described as large-vocabulary, continuous speech recognition or LVCSR. LVCSR systems provided improvements over the early systems, including larger vocabularies, the ability to recognize continuous speech, faster recognition, and better accuracy. For example, LVCSR systems may allow for applications such as the dictation of documents. As the technology has improved and as computers have become more powerful, LVCSR systems have been able to handle larger and larger vocabularies and increase accuracy and speed.
As LVCSR systems increase in size, the ASR engine performing the search may have a more complicated task and may require a significant amount of computing resources and time to perform the search. Certain devices, such as smartphones, tablet computers, and others, may lack the computing resources necessary to perform LVCSR or other types of ASR. It may also be inconvenient or inefficient to provide the hardware or software necessary for ASR on certain devices. As an alternative to implementing ASR directly on unsuitable devices, an ASR engine may be hosted on a server computer that is accessible via a network. Various client devices may transmit audio data over the network to the server, which may recognize any speech therein and transmit corresponding text back to the client devices. This arrangement may enable ASR functionality to be provided on otherwise unsuitable devices despite their limitations.
In addition to transmitting corresponding text back to client devices, a server computer may transmit other forms of responses back to client devices based on the speech recognition. For example, after a user utterance is converted to text by the ASR, the server computer may employ a natural language understanding (NLU) process to interpret and understand the user utterance. After the NLU process interprets the user utterance, the server computer may employ application logic to respond to the user utterance. Depending on the translation of the user utterance, the application logic may request information from an external data source. In addition, the application logic may request an external logic process. Each of these processes contributes to the total latency perceived by a user between the end of a user utterance and the beginning of a response.
As an example, a user utterance may request information to make a flight booking. The user device can receive a voice signal corresponding to the user utterance and transmit it to a server computer. First, the server computer can use ASR on the voice signal to determine the text of the user utterance. Next, the server computer can use a NLU process to understand that the user request seeks information to make a flight booking. Application logic on the server computer may then contact one or more external servers to obtain information on the requested flight booking. After application logic on the server computer determines that flight booking information responsive to the user request has been received, the server computer can transmit the responsive flight booking information to the user device. The user perceives the total time between when the user finished uttering the request and when the user device receives the responsive flight information from the server computer as a user-perceived latency.
It would be advantageous to be able to measure the user-perceived latency. Measuring the user-perceived latency functions as a quality-control metric for the user experience. In addition, any deviation from expected values for the user-perceived latency could provide an early warning of a problem within the system. By receiving the early warning, steps can be taken to quickly identify and correct any problems, thereby improving user experiences.
However, measurement of the user-perceived latency poses a problem. To determine the user-perceived latency, a difference in time between the end of a user utterance and the beginning of a response may be computed. The user device can determine when it receives a response, but it cannot readily determine the time at the end of the user utterance, as ASR is performed by the server device. Further compounding the problem, the internal clocks on the client device and the server computer are likely not in sync.
Additional background information on ASR systems may be found in L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliff, N.J., 1993; F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, 1997; X. Huang, A. Acero, H.-W. Hon. Spoken Language Processing, Prentice Hall, 2001; D. Jurafsky and J. Martin, Speech and Language Processing (2nd edition), Prentice Hall, 2008; and M. Gales and S. Young, The Application of Hidden Markov Models in Speech Recognition, Now Publishers Inc., 2008; which are incorporated herein by reference in their entirety.