1. Technical Field
The present invention relates generally to speech recognition and, in particular, Minimum Variance Distortionless Response (MVDR) based feature extraction for speech recognition.
2. Description of Related Art
Estimating the time-varying spectrum is a key first step in most feature extraction methods for speech recognition. Cepstral coefficients derived from a modified short-time spectrum is the most popular feature set and has been empirically observed to be the most effective for speech recognition. The modification of the spectrum is often based on perceptual considerations. Mel-Filtered Cepstral Coefficients (MFCC) is one such popular feature set.
Both parametric and non-parametric methods of spectrum estimation have been studied for speech modeling. Of the parametric methods, the Linear Predictive Coding (LPC) based all-pole spectrum is the most widely used. However, it has been noted in the speech modeling literature that for medium pitch voiced speech and high pitch voiced speech, LP based all-pole models do not provide good models of the spectral envelope. See, for example, El-Jaroudi et al., “Discrete All-Pole Modeling,” IEEE Trans. Signal Processing, Vol. 39(2), pp. 411–23, February 1991. Furthermore, Linear Predictive (LP) based cepstra are known to be very sensitive to noise. In contrast, non-parametric spectrum estimation methods such as the Fast Fourier Transform (FFT) based Periodogram or Modified Periodogram are attractive since these methods are entirely data-independent and, thus, do not suffer from problems arising due to modeling deficiencies. However, these methods often are not robust and therefore perform poorly in noisy and adverse conditions. In general, parametric methods with accurate models suited for the given application should be able to provide more accurate and robust estimates of the short-term power spectrum.
Minimum Variance Distortionless Response (MVDR) spectrum-based modeling of speech was recently proposed by Murthi et al., in “All-pole Modeling of Speech Based on the Minimum Variance Distortionless Response Spectrum,” IEEE Trans. on Speech and Audio Processing, pp. 221–39, May 2000. In the preceding article, it was shown that high order MVDR models provide elegant envelope representations of the short-term spectrum of voiced speech. This is particularly suited for speech recognition where model order is not a concern. Furthermore, it was shown that the MVDR spectrum is capable of modeling unvoiced speech, and mixed speech spectra. From a computational perspective, the MVDR modeling approach is also attractive because the MVDR spectrum can be simply obtained from a non-iterative computation involving the LP coefficients, and can be based upon conventional time-domain correlation estimates.
In speech recognition, in addition to faithful representation of the spectral envelope, statistical properties such as the bias and variance of the spectral estimate are also of great interest. Variance in the feature vectors has a direct bearing to the variance of the Gaussians modeling the speech classes. In general, reduction in feature vector variance increases class separability. Improved class separability can potentially increase recognition accuracy and decrease search speed.
Accordingly, it would be desirable and highly advantageous to have robust methods and apparatus for feature extraction for speech recognition that reduce feature vector variance.