Current speech recognition systems rely on a variety of different statistical models in performing speech recognition. Among those models are acoustic models and speech activity detection models. An acoustic model describes the acoustic properties of speech signals. A speech detection model is used to distinguish between speech signals and non-speech signals, such as background noise, and to feed only speech signals to the speech recognition engine.
Both of these types of models, and some others in speech systems, are generally statistical models that include many Gaussian mixtures. However, there are some problems associated with training these types of models.
In acoustic modeling, Gaussian probability distributions are built for thousands of different context dependent phones. In some current systems, these Gaussian mixtures are trained using maximum likelihood training. Basically, maximum likelihood training means that, for each sub-phone (sometimes referred to as a senone), given the data corresponding to the senone, Gaussian mixtures are built to represent the data distribution by maximizing the likelihood of producing the data given the Gaussian Mixture Model of that senone. Distributions of different senones are estimated separately. In addition, the interactions between different distributions are not explicitly considered in model training.
This type of maximum likelihood training encounters a problem, which is basically one of competition. In other words, in generating a speech recognition result, senone models compete with one another. For instance, a speech recognizer might generate a plurality of possible word strings for a given speech input. Each of these valid word strings (e.g., those word strings validated by a language model) includes a sequence of phones, and therefore, a sequence of corresponding senones. The different phone sequences in the different possible word strings compete with one another, and the phone sequence with the highest score wins. The winning phone sequence is output by the speech recognition system as the recognition result. The absolute value of the likelihood is unimportant.
Moreover, acoustic models are very complicated models. They usually include tens of thousands of multi-dimensional Gaussian probability distributions, and describe the properties of thousands of different context-dependent phones. In current maximum likelihood training systems, Gaussian distributions of different phones are trained using the same training techniques and the same settings. However, the properties of different phones may be very different, and may require different settings for the training algorithm in order to achieve optimal results.
Some of the problems associated with speech activity detection models are similar to those for acoustic models, and other speech-related models. A basic speech activity detection model in a speech recognition system has a number of functions. One function is to find a meaningful speech segment within an acoustic signal, and feed that speech segment into the recognition engine. Another basic function is to trigger a barge-in scenario when a user begins to speak to an automated system, such as a telephony system or another device based on automated speech recognition.
In performing the first function, the speech activity detection system attempts to reject silence or noise, as much as possible, which is equivalent to reducing the false acceptance rate of silence/noise, and provide only speech to the speech recognizer. This helps to ensure that recognition is more accurate.
In performing the second function, the system attempts to improve system performance so that it responds to the user as soon as possible, and so that the user experience is enhanced to some extent. The system attempts to reduce a false rejection rate—the rate at which valid speech signals are erroneously rejected as being noise or silence.
Energy-based detection systems are currently used in some speech activity detectors, and these types of systems can work quite well in normal conditions. However, one of the challenges in many applications which implement speech activity detectors (such as telephony or other speech recognition-based systems) is to address the presence of environmental noise or channel noise. In terms of energy content, the difference between a speech signal having a very low amplitude, and environmental noise or channel noise, is sometimes not significant enough to make an appropriate decision in the speech activity detector.
Another approach to speech activity detection is referred to as a recognition-based approach. This approach builds up a set of statistical models, each representing different events relative to the speech activity detector, such as speech, silence, the transition phase from silence to speech, and the transition phase from speech to silence, environmental noise, etc. By considering more subtle information than energy itself, these models can be integrated with a uniform statistical pattern recognition process. The output of the recognition process is used as the basis of a decision for a speech activity detector.
No matter which of these approaches are used, the goals of rejecting silence and responding to speech are not easy to meet. Usually, one must make a tradeoff. In other words, a developer must either tune the decision threshold closer to silence so that low amplitude speech signals will be passed to the speech recognition engine, and so that a barge-in scenario will be launched with a low amplitude speech signal, or one must tune the decision threshold closer to speech so that less non-speech waveforms are passed to the speech recognition system.
Speech detectors face other problems too. As mentioned above, the input waveform to a speech detection system can represent pure speech, or the transition phase from silence to speech (sometimes referred to as onset), or a short pause between speech phrases. The waveform can also represent silence, the echo of a prompt, coughing, environmental noise, etc., all of which corresponds to a non-speech segment. However, for a particular speech event (speech, non-speech, onset, etc.), the most often confused non-speech counterpart might be different. For example, the pure speech segment is often confused with an echo of a prompt or with background noise, because they all have a relatively high energy content. However, the transition phase from silence to speech is often confused with silence, because they have overlapping regions (silence).
In current training, all of the model parameters are trained with the same training framework and the same controlling parameters. However, it is clear that the most commonly confused speech events are different, depending on the speech event under analysis. For instance, the difference between speech and noise can be learned because of their different nature. While silence and the transition phase are not as easily learned because their training samples overlap one another.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.