Computer-implemented recognition systems have been designed to perform a variety of recognition tasks. Such tasks include analysis of a video signal to identify humans captured in such signal, analysis of a video signal to identify a gesture performed by a human, analysis of a video signal to recognize an object therein, analysis of a handwriting sample to identify characters included in the handwriting sample, analysis of an audio signal to determine an identity of a speaker captured in the audio signal, analysis of an audio signal to recognize spoken words, amongst others.
With respect to automatic speech recognition (ASR) systems, such systems are becoming increasingly ubiquitous. For example, mobile telephones are currently equipped with ASR systems that are configured to recognize spoken commands set forth by users thereof, thus allowing the user to perform other tasks while setting forth commands to mobile telephones. Gaming consoles have also been equipped with ASR systems that are likewise configured to recognize certain spoken commands, thereby allowing users of such gaming consoles to interact with the gaming consoles without requiring use of a handheld game controller. Still further, customer service centers accessible by telephone employ relatively robust ASR systems to assist users in connection with obtaining desired information. Accordingly, a user can access a customer service center by telephone, and set forth one or more voice commands to obtain desired information (or to be directed to an operator that can assist the user in obtaining the information).
With continuing reference to ASR systems, it can be relatively difficult to obtain suitable training data for training the ASR system prior to deploying the ASR system for real-world use. At least a portion of this difficulty is based upon the increased use of devices that capture wideband audio signals but the dearth of wideband signals that can be used for training. In an example, a sampling rate utilized to capture speech data generally depends upon the device that captures the speech data. For instance, mobile telephones that have relatively recently become available are configured capture wideband signals by sampling acoustic signals at a first sampling rate (e.g., 16 kHz), while older mobile telephones and landline telephones are configured to capture narrowband signals by sampling acoustic signals at a second, lower sampling rate (e.g., 8 kHz). There is currently a relatively large amount of narrowband data available for training ASR systems, there is currently a smaller amount of wideband data available for training ASR systems. Adding to the difficulty in connection with obtaining training data, certain signals may be subject to environmental or channel distortion.
Several approaches have been set forth for training ASR systems while considering the heterogeneous nature of at least some training data. For instance, wideband signals in training data can be down-sampled, such that the wideband signals are converted to narrowband signals. The ASR system is then trained using only narrowband signals. This approach is clearly suboptimal, as the wideband data includes additional information that may be useful to identify phones therein. Another approach is to up-sample the narrowband signals (e.g., extend the bandwidth of the narrowband signals in the training data). Procedures for up-sampling narrowband data, however, can be complicated and often introduce errors.
Still yet another exemplary approach is to build and train two separate ASR systems: one trained using wideband signals and configured to perform recognition tasks over wideband signals, and one trained using narrowband signals and configured to perform recognition tasks over narrowband signals. As noted above, however, currently there exists a relatively large amount of narrowband training data, while there is a dearth of wideband training data.