Many institutions, such as telephone companies, allow customers to access and control a wide variety of services and information by simply speaking into a telephone or microphone. The spoken sounds, such as the digits 0 to 9 are then recognized by a speech recognition system. FIG. 1 shows such a speech recognition system, including a feature analyzer 100 and a speech recognizer 300. The speech recognition system takes a speech input signal, such as the sound of the word "three," and produces an answer, such as a signal representing the number "3."
Different people, however, pronounce the same word, such as "three," in different ways. They may speak, for example, with different accents or have voices with different pitches. Such differences make it difficult to directly match the speech input signal with one or more sound samples to produce an answer. Therefore, it is known to first extract "features" from the speech input signals using the feature analyzer 100. The extracted features are typically selected so as to maintain invariance towards different speakers, styles, etc.
One widely used type of feature extraction is based on a mathematical system called "cepstral" analysis. In automatic speech recognition applications, N-dimensional signal vectors are represented by significantly shorter L-dimensional cepstral vectors. For each signal vector y, a cepstral vector c.sub.y containing the L low order cepstral components {c.sub.y (0), . . . c.sub.y (L-1)} is used. Typical values for N and L are N=256 and L=12. The low dimensional cepstral vector is often referred to as a "feature vector" in pattern recognition.
The extracted features can then be processed by the speech recognizer 300 to produce the answer. This is done by statistically modeling the cepstral vectors representing speech signal vectors for a given word in the vocabulary using a Hidden Markov Model (HMM). The HMM provides a parametric representation for the probability density function (pdf) of the cepstral vectors for a given word. It assumes that cepstral vectors can emerge from several Markovian states, where each state represents a Gaussian vector source with a given mean and covariance matrix. The parameters of the HMM, which consist of initial state probabilities, state transition probabilities, mixture gains, mean vectors and covariance matrices of different states and mixture components, are estimated from training data. Recognition of the speech signal is performed by finding the pre-trained HMM which scores the highest likelihood for the cepstral vectors of the input signal.
The state covariance matrices of the HMM are normally assumed diagonal. A justification for attributing a diagonal covariance matrix to cepstral vectors in a given HMM state is that, under some assumptions, the covariance matrix of a cepstral vector obtained from the smoothed periodogram of N samples of a Gaussian stationary signal is asymptotically proportional to an identity matrix as N and the spectral window length go to infinity.
In addition to providing significant reduction in dimensionality, and the asymptotic identity covariance matrix, the low order cepstral representation of acoustic speech signals captures the spectral envelope of the signal while suppressing the speaker dependent pitch information which is less relevant to speech recognition. The dynamic range of the signal is also reduced in a manner similar to that performed by the human auditory system, and equalization of stationary transmission channels, or microphone transducers used during different recording sessions, is possible using simple subtraction techniques. Because of these useful properties, cepstral representation of acoustic speech signals has become the standard approach in the industry.
Known speech recognition systems using cepstral representation, however, have a number of drawbacks. For example, when performing Gaussian statistical modeling of cepstral vectors, as is commonly done in automatic speech recognition using HMMs, a system must use a large number of signal dependent parameters. The large number of parameters and the complex nature of the HMMs require a tremendous amount of computational power. Such a system can also be too slow for "real time" use. This modeling complexity is even more significant for complex speech recognition systems where thousands of HMM states are used. In addition, the large number of parameters that must be estimated requires a huge amount of training data for meaningful estimation of the HMMs.
Another problem with known systems is the non-linear nature of cepstral representation, which is caused by the introduction of a logarithmic function. This creates major difficulties when the recognizer 300 is trained on "clean" speech signals, and then tries to recognize "noisy"speech signals. Such a situation can be encountered, for example, when recognizing wireless communication signals or signals obtained through pay phones. In this case noise additivity is not maintained in the cepstral domain, and the effect of the noise on the cepstral representation of the clean signal is rather difficult to quantify. The mismatch between training and testing conditions is hard to correct, especially when the signal is corrupted by additive noise.
Estimation of "clean" cepstral components in noisy environments has been attempted using a "bias removal" approach, a "stochastic matching" approach, and a "parallel model combination" approach. The "stochastic matching" approach attempts to estimate the parameters of an affine transformation, either from noisy to clean cepstral vectors or from the clean to noisy cepstral models. An explicit form for such a transformation has not been developed. Instead, data driven transformations are calculated by invoking the maximum likelihood estimation principle which is implemented using the Expectation Maximization (EM) procedure. This approach has also been implemented for "bias removal" from cepstral components. The aim of bias removal is to compensate the cepstral components for a bias introduced by an unknown communication channel or a transducer microphone that is different from that used in collecting the training data. In the "parallel model combination" approach, the parameters (state dependent means and variances) of separate HMMs for the clean signal and the noise process are combined using numerical integrations or empirical averages to form an HMM for the noisy signal.
With all of these approaches, however, it is very difficult to estimate the cepstrum of the clean signal from the cepstrum of the noisy process. This estimation is essential to improving the robustness of speech recognition systems in noisy environments.
In view of the foregoing, it can be appreciated that a substantial need exists for a method and apparatus that reduces the number of signal dependent parameters required when statistically modeling cepstral vectors, allows for a simple estimation of the cepstrum of a clean signal, and solves the other problems discussed above.