Speech recognition is a process by which an unknown speech utterance (usually in the form of a digital PCM signal) is identified. Generally, speech recognition is performed by comparing the features of an unknown utterance to the features of known words or word strings.
The features of known words or word strings are determined with a process known as "training". Through training, one or more samples of known words or strings (training speech) are examined and their features (or characteristics) recorded as reference patterns (or recognition unit models) in a database of a speech recognizer. Typically, each recognition unit model represents a single known word. However, recognition unit models may represent speech of other lengths such as subwords (e.g., phones, which are the acoustic manifestation of linguistically-based phonemes). Recognition unit models may be thought of as building blocks for words and strings of words, such as phrases or sentences.
To recognize an utterance in a process known as "testing", a speech recognizer extracts features from the utterance to characterize it. The features of the unknown utterance are referred to as a test pattern. The recognizer then compares combinations of one or more recognition unit models in the database to the test pattern of the unknown utterance. A scoring technique is used to provide a relative measure of how well each combination of recognition unit models matches the test pattern. The unknown utterance is recognized as the words associated with the combination of one or more recognition unit models which most closely matches the unknown utterance.
Recognizers trained using both first and second order statistics (i.e., spectral means and variances,) of known speech samples are known as hidden Markov model (HMM) recognizers. Each recognition unit model in this type of recognizer is an N-state statistical model (an HMM) which reflects these statistics. Each state of an HMM corresponds in some sense to the statistics associated with the temporal events of samples of a known word or subword. An HMM is characterized by a state transition matrix, A (which provides a statistical description of how new states may be reached from old states), and an observation probability matrix, B (which provides a description of which spectral features are likely to be observed in a given state). Scoring a test pattern reflects the probability of the occurrence of the sequence of features of the test pattern given a particular model. Scoring across all models may be provided by efficient dynamic programming techniques, such as Viterbi scoring. The HMM or sequence thereof which indicates the highest probability of the sequence of features in the test pattern occurring identifies the test pattern.
A major hurdle in building successful speech recognition systems is non-uniformity in performance thereof across a variety of conditions. Many successful compensation and normalization techniques have been proposed in an attempt to deal with differing sources of non-uniformity in performance. Some examples of typical sources of non-uniformity in performance in telecommunications applications of speech recognition include inter-speaker, channel, environmental, and transducer variability, and various types of acoustic mismatch.
Model adaptation techniques have been used to improve the match during testing (i.e., during recognition of unknown speech) between a set of unknown utterances and the hidden Markov models (HMMs) in the recognizer database. Some model adaptation techniques involve applying a linear transformation to the HMMS. The parameters of such a linear transformation can be estimated using a maximum likelihood criterion, and then the transformation is applied to the parameters of the HMMs. A perplexing problem not heretofore solved is the existence of speakers in a population for whom speech recognition performance does not improve after model adaptation using such linear transformation techniques. This is especially true for unsupervised, single utterance-based adaptation scenarios.
It is generally thought that only those distributions in the HMMs that are likely to have been generated (during training) by the unknown utterance have a chance to be mapped more closely to the target speaker with such linear transformation model adaptation techniques. Therefore, if the "match" between the HMMs and the unknown utterance is not reasonably "good" to begin with and the number of unknown utterances is limited (such as for example in a single utterance-based adaptation scenario), then the utterance cannot "pull" the model to better match the target speaker in such conventional model adaptation techniques. Thus, there exists a subset of utterances for which model adaptation does not improve speech recognition performance.
Frequency warping for speaker normalization has been applied to telephone-based speech recognition applications. In previous testing practice, frequency warping for speaker normalization has been implemented by estimating a frequency warping function that is applied to the unknown input utterance so that the warped unknown utterance is better matched to the given HMMs. As is the case for model adaptation, there exists a subset of utterances for which frequency warping does not improve speech recognition performance.
The frequency warping approach to speaker normalization compensates mainly for inter-speaker vocal tract length variability by linear warping of the frequency axis (i.e., applying a frequency transformation to the frequency axis in the frequency domain), by a factor .alpha., where an .alpha.=1.00 corresponds to no warping (no frequency transformation).
The front-end of a conventional speech recognizer processes samples of an unknown speech signal. The samples are obtained from recording windows of a specified duration (e.g., 10 ms) on the unknown speech signal, and such windows may overlap. The samples of the unknown speech signal are processed using a fast Fourier transform (FFT) component. The output of the FFT component is further processed and coupled to a mel-scale filterbank (which is also referred to as a mel-cepstrum filterbank). The mel-scale filterbank is a series of overlapping bandpass filters, wherein the bandpass filters in the series have a spacing and bandwidth which increases with frequency along the frequency axis. The output of the mel-scale filterbank is a spectral envelope. An additional transformation to the spectral envelope provides a sequence of feature vectors, X, characterizing the unknown speech signal.
In previous practice, frequency warping has been implemented in the mel-scale filterbank of the front-end of the speech recognizer by linear scaling of the spacing and bandwidth of the filters within the mel-scale filterbank. The warping factor is an index to the amount of linear scaling of the spacing and bandwidth of the filters within the mel-scale filterbank. Scaling the mel-scale filterbank in the front-end is equivalent to resampling the spectral envelope using a compressed or expanded frequency range. Changes in the spectral envelope are directly correlatable to variations in vocal tract length.
In frequency warping for speaker normalization according to previous practice, an ensemble of warping factors is made available, each being an index corresponding a particular amount of linear scaling, and thus, to a particular spacing and bandwidth of the filters within the mel-scale filterbank. For each utterance, the optimal warping factor .alpha. is selected from a discrete ensemble of possible values so that the likelihood of the warped utterance is maximized with respect to a given HMM and a given transcription (i.e., a hypothesis of what the unknown speech is). The values of the warping factors in the ensemble typically vary over a range corresponding to frequency compression or expansion of approximately ten percent. The size of the ensemble is typically ten to fifteen discrete values.
Let X.sup..alpha. =g.sub..alpha. (X) denote the sequence of cepstral observation vectors (i.e., the sequence of feature vectors), where each observation vector (i.e., each feature vector) is warped by the function g.sub..alpha. ( ), and the warping is assumed to be linear. If .lambda. denotes the set of HMMs and the parameters thereof, the optimal warping factor is defined as: ##EQU1## where H is the transcription (i.e., a decoded string) obtained from an initial recognition pass using the unwarped sequence of feature vectors X. This frequency warping technique is computationally efficient since maximizing the likelihood in Eq. 1 involves only the forced probabilistic alignment of the warped observation vectors X.sup..alpha. to a single string H. Finally, the frequency-warped sequence of feature vectors X.sup..alpha. is used in a second recognition pass to obtain the final recognized string.
Frequency warping for speaker normalization according to previous practice transforms an utterance according to a parametric transformation, g.sub..alpha. ( ), in order to maximize the likelihood criterion given in Eq. 1.
There is a large class of maximum likelihood-based model adaptation procedures that can be described as parametric transformations of the HMMs. For these procedures, let .lambda..sub..gamma. =h.sub..gamma. (.lambda.) denote the models obtained by a parametric linear transformation h.sub..gamma. ( ) of the original set of HMMs and parameters thereof. The form of the parametric linear transformation can depend on the nature of the sources of non-uniformity in speech recognition performance and the number of observations (i.e., the sequence of feature vectors) available for estimating the parameters of the transformation.
A maximum likelihood criterion similar to that used for estimating a is used for estimating .gamma.: ##EQU2##
The article by McDonough et al. entitled "An Approach To Speaker Adaptation Based On Analytic Functions", Proc. Intl. Conf. on Acoustics, Speech and Signal Processing, Atlanta, Ga., May 1996, pages 721-724, suggests that in an HMM-based speech recognition system the effect of frequency warping for speaker normalization is equivalent to that obtained from either a linear transformation applied to the cepstral feature space or a linear transformation applied to the means of the HMMs. In view of that reference, frequency warping for speaker normalization and a linear transformation in the cepstral feature space are the same, and would therefore have equivalent effects. McDonough et al. teaches that combining frequency warping for speaker normalization and a linear transformation in the cepstral feature space are redundant.