The present invention relates to distinguishing between two non-stationary signals, and more particularly, to using a wavelet transform to detect voice (speech) activity.
Speech is produced by excitation of an acoustic tube, the vocal tract, which is terminated on one end by the lips and on the other end by the glottis. There are three basic classes of speech sounds. Voiced sounds are produced by exciting the vocal tract with quasi-periodic pulses of airflow caused by the opening and closing of the glottis. Fricative sounds are produced by forming a constriction somewhere in the vocal tract and forcing air through the constriction so that turbulence is created, thereby producing a noiselike excitation. Plosive sounds are produced by completely closing off the vocal tract, building up pressure behind the closure, and then abruptly releasing it.
It is well known in the art that because a vocal tract has a constant shape, voiced signals can be modeled as the response of a linear time-invariant system to a quasi-periodic pulse train. Unvoiced sounds can be modeled as wideband noise. The vocal tract is an acoustic transmission system characterized by natural frequencies (formants) that correspond to resonances in its frequency response. In normal speech, the vocal tract changes shape relatively slowly with time as the tongue and lips perform the gestures of speech, and thus the vocal tract can be modeled as a slowly time-varying filter that imposes its frequency-response on the spectrum of the excitation.
FIG. 1a illustrates a waveform for the word "two." The waveform is an example of a non-stationary signal because the signal properties vary with time. Background noise is another example of a non-stationary signal. However, unlike background noise, the characteristics of a speech signal can be assumed to remain essentially constant over short (30 or 40 ms) time intervals.
FIG. 1b illustrates a spectrogram of the waveform shown in FIG. 1a. The frequency content of speech can range up to 15 kHz or higher, but speech is highly intelligible even when bandlimited to frequencies below about 3 kHz. Commercial telephone systems usually limit the highest transmitted frequency to the 3-4 kHz range.
A typical speech waveform consists of a sequence of quasi-periodic voiced segments interspersed with noise-like unvoiced segments. A GSM speech coder, for example, takes advantage of the fact that in a normal conversation, each person speaks on average for less than 40% of the time. By incorporating a voice activity detector (VAD) in the speech coder, GSM systems operate in a discontinuous transmission mode (DTX). Because the GSM transmitter is inactive during silent periods, discontinuous transmission mode provides a longer subscriber battery life and reduces instantaneous radio interference. A comfort noise subsystem (CNS) at the receiving end introduces a background acoustic noise to compensate for the annoying switched muting which occurs due to DTX.
Voice activity detectors are used quite extensively in the area of wireless communications. Voice activity detectors are not only used in GSM speech coders, but they are also used in other discontinuous transmission systems, noise suppression, echo canceling, and voice dialing systems. Because speech is usually accompanied by background noise, some segments of a speech signal have voiced sounds with background noise, some segments have noise-like unvoiced sounds with background noise, and some segments have only background noise. The voice activity detector's job is to distinguish voiced regions of the signal from unvoiced or background noise regions.
There are several known methods for voice activity detection. For example, U.S. Pat. No. 5,459,814 discloses a method in which an average signal level and zero crossings are calculated for the speech signal. Similarly, U.S. Pat. No. 5,596,680 discloses performing begin point detection using power/zero crossing. Once the begin point has been detected, the cepstrum of the input signal is used to determine the endpoint of the sound in the signal. After both the beginning and ending of the sound are detected, this system uses vector quantization distortion to classify the sound as speech or noise. While these methods are relatively easily to implement, they are not considered to be reliable.
Patent publication WO 95/08170 and U.S. Pat. No. 5,276,765 disclose a method in which a spectral difference between the speech signal and a noise estimate is calculated using linear prediction coding (LPC) parameters. These publications also disclose an auxiliary voice activity detector that controls updating of the noise estimate. While this method is relatively more reliable than those previously discussed, it is still difficult to reliably detect speech when the speech power is low compared to the background noise power.
Input signals are often analyzed by transforming the signal to a plane other than the time domain. Signals are usually transformed by utilizing appropriate basis functions or transformation kernels. The Fourier transform is a transform that is often used to transform signals to the frequency domain. The Fourier transform uses basis functions that are orthonormal functions of sines and cosines with infinite duration. The transform coefficients in the frequency domain represent the contribution of each sine and cosine wave at each frequency.
Patent publication WO 97/22117 is an example of how the Fourier transform is used to detect voice activity. WO 97/22117 discloses dividing an input signal into subsignals representing specific frequency bands, estimating noise in each subsignal, using each noise estimate to calculate subdecision signals, and using each subdecision signal to make a voice activity decision.
The problem with using the Fourier transform is that the Fourier transform works under the assumption that the original time domain signal is periodic in nature. As a result, the Fourier transform is poorly suited for nonstationary signals having discontinuities localized in time. When a non-stationary signal has abrupt changes, it is not possible to transform the signal using infinite basis functions without spreading the discontinuity over the entire frequency axis. The transform coefficients in the frequency domain can not preserve the exact occurrence of the discontinuity and this information is lost.
Unfortunately, many real signals are nonstationary in nature and the analysis of these signals involves a compromise between how well transitions or discontinuities are located and how finely long-term behavior can be identified. One attempt to improve the performance of the Fourier transform involves replacing the complex sinusoids of the Fourier transform with basis functions composed of windowed complex sinusoids. This technique, which is often referred to as the short time Fourier transform (STFT), is best illustrated by the equation, ##EQU1##
where h(.) is a window function and T.sub.F (.omega.,.tau.) is the Fourier transform of x(t) windowed with h(.) shifted by .tau.. Although the STFT overcomes some of the problems associated with using infinite basis functions, the STFT still suffers from the fact that the analysis product is the same at all locations in the time-frequency plane. Generally speaking, voice activity detectors that use the Fourier transform or the short time Fourier transform are unreliable and require costly (power-consuming) computations. There is a need for a voice activity detector that can reliably and efficiently distinguish voiced regions of speech signals from unvoiced or background noise regions.