Computing devices are increasingly using automatic speech recognition (ASR) systems in conjunction with text-to-speech (TTS) systems to function as a type of user interface. For example, a computing device may listen for spoken commands via a microphone, which are recognized by the computing device to perform certain type of functions. Furthermore, the computing device may provide feedback or prompt the user in the form of simulated speech via a speaker. In doing so, a user may interact with the computing device without providing tactile user input or looking at a display, which may obviate the need for a display entirely and allow the user to interact with the computing device in a hands-free manner.
To recognize speech, ASR systems traditionally implement ASR algorithms that are trained in accordance with some level of expected noise and type of speech, which typically involves establishing a set of “tuning parameters” that are static and do not change. For example, an ASR system in a vehicle navigation device may implement ASR algorithms for a particular language or dialect, and thus may be trained for corresponding noise characteristics associated with its intended use, i.e., vehicle cabins, which are a relatively quiet, stable, and predictable. As a result, if the environment deviates from those which the ASR algorithms were originally trained, the ASR system may fail to properly recognize speech. For instance, the vehicle navigation device described above may work well while the vehicle cabin remains relatively quiet (i.e., similar to how the algorithm was originally trained), but fail to recognize speech properly when the windows are open while driving and the vehicle cabin is noisier.
Moreover, ASR systems may experience false accepts (FAs) and false rejects (FRs) as part of their operation. FAs are associated with an incorrect identification of a particular word or phrase, whereas FRs are associated with the ASR system failing to recognize a particular word or phrase, which commonly occurs in noisier environments. Therefore, ASR algorithms are typically tuned to either minimize FAs (for quiet environments), or to minimize FRs (for noisy environments), which are different goals requiring different sets of tuning parameters. Because conventional ASR algorithms implement a single algorithm for a single type of noise environment and speech, a compromise must be struck between these two goals, preventing the ASR system from being truly optimized across different environments.
Furthermore, conventional ASR systems may implement two different speech recognizers to identify different portions of speech. For instance, a trigger speech recognizer may be implemented to listen for a “wake word.” Once the wake word is recognized, a command speech recognizer may be implemented to recognize subsequently spoken words. However, even when separate speech recognizers are implemented, both speech recognizers rely on the same set of ASR algorithms, which are tuned to the same, static noise environment and type of speech. Therefore, the FAs and FRs for the trigger speech recognizer and the command speech recognizer cannot be independently minimized.
As a result, current ASR systems, and the way the ASR algorithms are implemented in accordance with such systems, have several drawbacks and limitations.