A. Field of the Invention
The present invention relates generally to speech analysis systems, and more particularly, to the estimation of speech spectral envelope parameters for speech signals in the presence of noise.
B. Description of Related Art
Automated speech analysis has important applications in modern society. Such applications can include speech recognition systems, word spotting systems, speaker recognition systems, vocoders, speech enhancement systems, language recognition systems, and other systems which analyze human speech signals.
A key operation performed in many speech analysis systems is the estimation of parameters describing the speech spectral envelope. The spectral envelope can be thought of as an amplitude curve in the frequency-domain. The parameters describing the spectral envelope are typically estimated every 10-25 ms from (possibly overlapping) segments of a speech signal ranging from 15-30 ms in duration. Often, parameters correspond to an all-pole (i.e., autoregressive) representation of the spectral envelope. Such a representation can be related to an acoustic tube model of the human vocal tract.
Speech enhancement systems, for example, generally apply a time-varying linear filter to the input speech signal for the purpose of producing an enhanced output speech signal. Robust estimation of speech and noise spectrum parameters can help with the design of the time-varying linear filter. Some speech enhancement systems are used as a preprocessor to a vocoder or recognition system to improve the performance of the vocoder or recognition system. When the input speech signal includes acoustic noise, the time-varying linear filter may try to approximate a Weiner filter so that the output speech signal is relatively free of acoustic noise. Other speech enhancement systems may seek to compensate for deleterious effects of mechanical, electrical, or other systems that may have distorted the speech signal or they may seek to transform the input speech signal for some other purpose (e.g., to disguise the persons voice). In some systems, the estimated spectral envelope parameters are quantized to one of a finite number of possibilities. A vocoder is one such speech system that quantizes the spectral envelope parameters. In general, a vocoder analyzes a speech signal and transmits a quantized version of the spectral envelope parameters of the speech signal. The communication link over which the quantized version of the spectral envelope parameters are transmitted may be a low data rate communication link. A receiver synthesizes a speech signal for presentation to a human user based on the parameters.
Speech analysis systems tend to suffer degraded performance in harsh acoustic noise environments. In such environments, a noise signal (which may be due, e.g., to various types of machinery or natural phenomena) is sensed along with the speech signal. The noise-corrupted speech signal is thus presented to the speech analysis system. If the noise is sufficiently strong, the spectral envelope parameters may not closely match the true spectral envelope parameters of the speech signal absent the noise. In the case of a vocoder speech analysis system, this may mean that the synthesized human voice is no longer sufficiently intelligible to a human listener.
Speech recognition systems generally estimate spectral envelope parameters similar to those estimated in vocoder systems. In such speech recognition systems, the spectral envelope is typically represented by about 10-14 “cepstral” parameters. As with vocoder systems, when the signal presented to such systems is corrupted by sufficiently strong noise, these cepstral parameters will be sufficiently different so as to increase the word recognition error rate of the system.
A common theme among many conventional speech analysis systems, whether or not they are specifically designed to address the issues of noise corruption, is that they employ a two-step paradigm in which they estimate parameters and then quantize the parameters to obtain the final speech spectral envelope. Although the first step, estimation, may reduce the signal segment to a relatively small number of parameters, these parameters are effectively unquantized and, in principle, may represent any one of an infinite number of speech spectral envelopes. Nonetheless, the second, quantization step reduces this to one of a finite number of speech spectral envelopes. Results of two-step estimate-and-then-quantize techniques can degrade significantly in the presence of noise.
Thus, it would be desirable to more effectively obtain speech spectral envelopes, particularly as the signal-to-noise ratio (SNR) of the measured signal decreases.