Automatic speech recognition (“ASR”) systems are designed to process input audio signals containing human speech and convert the human speech into text. To improve the performance of an ASR system, a process known as “endpoint detection” (or “endpointing”) is often performed on the input audio signal to estimate where the human speech begins and/or ends. For example, effective endpoint detection may remove portions of the input signal before or after speech begins, so that the system does not process audio including only silence, background noise (e.g., radio or other conversations) and/or non-speech sounds (e.g., breathing and coughing), thereby improving recognition accuracy and reducing system response time.
Various conventional techniques have been proposed to detect speech endpoints (e.g., speech to non-speech transitions and/or non-speech to speech transitions) in an input audio signal. These techniques analyze the input audio signal to determine one or more energy-related features of the signal, because the energy of speech sounds is typically greater than the energy of non-speech sounds. These techniques are computationally intensive.
FIG. 1 illustrates an example of a conventional ASR system 100 that implements endpoint detection. In this example, a microphone 105 converts speech sound waves into an electrical audio signal, which is then digitized (e.g., sampled and quantized) by an analog-to-digital converter (“ADC”) 110. An endpoint detection module 115 processes the digitized signal to estimate where human speech begins and/or ends. The resulting endpointed signal is provided to an ASR engine 120 for processing into recognized text.
Because speech recognition tends to be computationally intensive, it is often difficult or unfeasible to implement a full ASR system on a device with limited resources, such as a mobile device with limited processing capabilities and storage reserves (e.g., a mobile phone, personal digital assistant, etc). Even on a device with sufficient resources, a full ASR system may be difficult to implement because the computing environment in which the ASR system is running (e.g., a Java Runtime Environment) may not make certain resources available to applications such as the ASR system. For example, the computing environment may not allow applications full access to the device's processing capabilities, or it may not expose to applications the raw speech data output by the ADC.
In some applications, a mobile device serves as a front end that captures an input audio signal and transmits the signal (or some processed representation thereof) to a backend server that performs speech recognition on the signal. An example of such an architecture is illustrated in FIG. 2.
As shown in FIG. 2, a mobile device 200 includes a microphone 105 that converts speech sound waves into an electrical audio signal and an ADC 110 that digitizes the audio signal. An encoder 205 converts the digitized signal into an encoded representation that is more compact (i.e., compressed) or otherwise more suitable for transmission. The encoded signal is transmitted to the backend server 250 over one or more communication network(s) 225 and is decoded by a decoder 255 in the server 250. The output of the decoder 255 is a decoded signal that approximates the digitized signal prior to encoding. An endpoint detection module 115 performs endpointing on the decoded signal and provides the resulting endpointed signal to an ASR engine 120. The ASR performs speech recognition processing on the decoded signal, while being informed by the endpointed signal. The recognized text output by the ASR engine 120 is returned to the mobile device 200 via the communication network(s) 225, for example, to be displayed to a user of the mobile device 200.