1. Field of the Invention
The present invention relates to a method for robust speech processing through modeling channel and noise variations with affine transforms.
2. Description of the Related Art
Conventional speaker and speech recognition systems enable a computer to successfully perform voice verification or speech recognition. An identified speaker pattern can be used in a speaker verification system in order to verify a speaker's claimed identity from an utterance. It is known that conventional telephone switching systems often route calls between the same starting and ending locations on different channels. In addition, different telephony systems such as electret handset telephones, cellular telephones and speaker telephones operate over different channels and have varying noise conditions. A spectrum of speech determined from the different channels can have a different shape due to the effects of the channel or noise. Recognition of speakers or speech on different channels under varying noise conditions is therefore difficult because of the variances in the spectrum of speech due to non-speech spectral components.
Speech has conventionally been modeled in a manner that mimics the human vocal tract. Linear predictive coding (LPC) has been used for describing short segments of speech using parameters which can be transformed into a spectrum of positions (frequencies) and shapes (bandwidths) of peaks in the spectral envelope of the speech segments. LPC cepstral coefficients represent the inverse z transform of the logarithm of the LPC spectrum of a signal. Cepstrum coefficients can be derived from the frequency spectrum or from linear predictor (LP) coefficients. Cepstrum coefficients can be used as dominant features for speaker recognition.
One conventional attempt for modeling noise and environmental changes uses an adaptive transformation for normalization of speech, see Naidas et al., Adaptive Labeling Normalization Of Speech By Adaptive Transformation Based on Vector Quantization, IEEE, 1988. A transformation due to the changes in the talker is represented as A.sub.i.sup.(1). A transformation defined by environmental changes and noise is represented by A.sub.i.sup.(2) and a nonadaptive signal processing transform is represented by A.sub.i.sup.(3). The unadapted signal available to the recognizer is represented by EQU X(t)=A.sub.t.sup.(3) A.sub.t.sup.(2) A.sub.t.sup.(1) X.sub.0 (t).
The normalization transformation is perturbed by moving the transformation in the direction to reduce error.
Another conventional attempt models variations in channel and noise conditions by transformation of the cepstrum coefficients. The transformation in the cepstrum domain of a speech vector has been defined as c'(n)=ac(n)+b wherein "a" represents rescaling of cepstrum vectors and "b" is a correction for noise. Conventional methods determine the "best" transform parameters by either mapping the test data to the training model or the training model to the test data. After mapping is performed, a metric is used to measure the distance between the transformed and target cepstrum coefficients. The drawback of the above described mapping scheme is that uncertainty can be introduced into the distance measure since the method assumes that every model is available as a potential match for a test object and thus the method matches test data even if the imposter's model is far away from the target. It is desirable to provide a method for robustly processing speech to minimize channel and noise variations.