The present invention relates to speech recognition. In particular, the present invention relates to the testing and tuning of a speech recognizer.
First, a basic description of the processes used in a speech recognition system will be described. In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector is typically multidimensional and represents a single frame of the speech signal.
To identify a most likely sequence of words, the feature vectors are applied to one or more models that have been trained using a training text. Typically, this involves applying the feature vectors to a frame-based acoustic model in which a single frame state is associated with a single feature vector. Recently, however, segment models have been introduced that associate multiple feature vectors with a single segment state. The segment models are thought to provide a more accurate model of large-scale transitions in human speech.
All models, both frame based and segment based, determine a probability for an acoustic unit. In initial speech recognition systems, the acoustic unit was an entire word. However, such systems required a large amount of modeling data since each word in the language had to be modeled separately. For example, if a language contains 10,000 words, the recognition system needed to 10,000 models.
To reduce the number of models needed, the art began using smaller acoustic units. Examples of such smaller units include phonemes, which represent individual sounds in words, and senones, which represent individual states within phonemes. Other recognition systems used diphones, which represent an acoustic unit spanning from the center of one phoneme to the center of a neighboring phoneme. More recent recognition systems have used triphones which represent an acoustic unit spanning three phonemes (such as from the center of one phoneme through the primary phoneme and to the center of the next phoneme).
When determining the probability of a sequence of feature vectors, speech recognition systems of the prior art did not mix different types of acoustic units. Thus, when determining a probability using a phoneme acoustic model, all of the acoustic units under consideration would be phonemes. The prior art did not use phonemes for some segments of the speech signal and senones for other parts of the speech signal. Because of this, developers had to decide between using larger units that worked well with segment models or using smaller units that were easier to train and required less data.
During speech recognition, the probability of an individual acoustic unit is often determined using a set of Gaussian distributions. At a minimum, a single Gaussian distribution is provided for each feature vector spanned by the acoustic units.
The Gaussian distributions are formed from training data and indicate the probability of a feature vector having a specific value for a specific acoustic unit. The distributions are formed from training data composed illustrating by thousands of repetitions of the different acoustic units found in different places, contexts by different speakers and with different acoustic conditions. A final distribution can be described as an approximation of the histogram of all the vectors for all the occurrences of a particular modeling unit. For example, for every occurrence of the phoneme “th” in the training text, the resulting values of the feature vectors are measured and used to generate the Gaussian distribution.
Because different speakers produce different speech signals, a single Gaussian distribution for an acoustic unit can sometimes produce a high error rate in speech recognition simply because the observed feature vectors were produced by a different speaker than the speaker used to train the system. To overcome this, the prior art introduced a mixture of Gaussian distributions for each acoustic unit. Within each mixture, a separate Gaussian is generated for one group of speakers. For example, there could be one Gaussian for the male speakers and one Gaussian for the female speakers.
Using a mixture of Gaussians, each acoustic unit has multiple targets located at the mean of each Gaussian. Thus, by way of example, for a particular acoustic unit, one target may be from a male training voice and another target may be from a female training voice.
However, even as the development of speech recognizers advanced there have remained many problems with the accuracy of the recognizers when presented with certain types of words. As the accuracy of the vectors has increased, errors still occur, due to the packaging and interpretation of the packaged vectors. These problems can include errors due to mismatches between the acoustic model and the utterances spoken, between the language model and the expected text, a combination of both, or other problems such as errors in the pronunciations or in the speech recognizer engine. Among the problems related with the language model a particularly difficult problem is the one with the homonyms.
Homonyms are words that sound alike, but have a different spelling and meaning. For example common homonyms include read/reed, read/red, their/there, here/hear, cue/queue, whether/weather, and fore/for/four. As these words are pronounced exactly the same the recognizer must chose one of the words to match the spoken utterance. In most cases the recognizer selects the word that is indicated as a preferred word. This preference can be done for example, according to which word is the most commonly used version of the word, or which word linguistically appears to be appropriate using language model information.
Language model related errors arise in instances where the speech recognition system cannot recognize individual words in any context regardless of the data input. In this situation the expected word appears in the list of alternates but it is not the first choice. These words can be recognized as long as you reduce the weight of the language model. Language model induced errors are instances where the speech recognition system can recognize individual words when the words are presented in isolation but not in the context in which these words are presented in the test. For example, if the language model can recognize “to hose” in isolation, but not “want to hose” (for example, the system may recognize the input as “want to host”) this is a language model error. In a second example of such an error is where the language model will properly recognize “July 25th”, but not “July 25th.”.
Other errors can be attributed to acoustic model mismatch, the speaker, and other sources. Most often these errors are due to a mismatch between the speaker production of the utterances and the models due to a different pronunciation, accent, noise environment, etc., and are not caused by any internal error in the system. However, because of the nature of the speech recognition systems, these types of errors can appear similar to the above errors. Therefore, it is necessary for the developer to identify the other error types without having to consider the possibility that the errors stemmed from an acoustic mismatch, for instance, the present invention addresses at least some of these problems.