The present invention relates generally to a method for estimating a speech signal in the presence of noise and, more particularly, to soft decision signal estimation method for generating a soft estimate of a speech signal contained in a received signal.
One function of the digital communication system is to transmit a speech signal from a source to a destination. The speech signal is often corrupted by noise which complicates and degrades the performance of coding, detection, and recognition algorithms. This problem is particular severe in mobile communication systems where numerous common sources of noise exist. For example, common noise sources in a mobile communication system include engine noise, background music, environmental noise (such as noise from an open window), and background speech from other persons. The efficiency of coding and recognition algorithms depends on being able to efficiently and accurately estimate both the speech and noise components of a received signal. There are many approaches presented in the literature to solve this problem. Among those, spectral subtraction is one of the most popular techniques because the speech signal is quasi-stationary, and the algorithm can be implemented efficiently using the Fast Fourier Transform (FFT).
The spectral subtraction method for signal estimation is based on the assumption that speech is present. When transmitted over the communication channel, the speech signal is corrupted by noise. The signal observed at the receiving end is the mixture of the speech signal and noise signal. The received signal is filtered in the frequency domain by a filter, such as a matched filter, that attempts to minimize the noise component in the received signal. The output of the matched filter is the estimate of the speech signal based on the assumption that speech was transmitted.
A filter commonly used in a signal detector is a Wiener filter, which minimizes the mean square error between the transmitted speech signal and the signal estimate. The Wiener filter uses the power spectral density (PSD) of the speech signal and noise signal to produce an estimate of the speech signal. Because the speech and noise signals are combined in the received signal, it is generally not possible to calculate the power spectral density of the speech signal and noise signal simultaneously. However, in a voice communication system, such as a mobile communication system, the speech signal is not present at all times. Thus, the power spectral density of the noise signal can be estimated during the time that the speech is absent. Assuming that changes in the noise signal are slow, the power spectral density of the speech signal can be calculated during the time that speech is present by subtracting the power spectral density of the noise signal (calculated when speech was not present) from the power spectral density of the received signal. This technique for calculating the power spectral density of the speech signal assumes that the speech signal and noise signal are independent, which is not always correct.
In order to estimate the power spectral density of the noise signal and speech signal, a voice activity detector (VAD) is used to detect the presence of speech in the received signal. In a conventional VAD, the received signal input to the VAD is filtered, squared, and summed in order to measure the power of the signal during a given time period. The VAD produces an estimate {circumflex over (xcex8)} indicating whether speech is present. In a conventional detector, a hard decision is made, meaning that {circumflex over (xcex8)} takes on a value of 1 when speech is present and a value of 0 when speech is not present. The output of the Wiener filter is multiplied by {circumflex over (xcex8)}. Consequently, a final estimate of the speech signal ŝ(k) is output only when {circumflex over (xcex8)} equals one. This method of signal estimation is known as hard decision estimation.
In hard decision signal estimation, errors made by the voice activity detector can result in significant error in final estimate of the speech signal. For example, assume that a signal containing speech is received but is not detected by the voice activity detector. In this case, the speech signal will not be output from the signal detector.
Soft decision signal estimation was explored in R J McAulay and M L Loupes, SPEECH ENHANCEMENT USING A SOFT DECISION NOISE SUPPRESSION FILTER, IEEE. Trans. in Acoustics Speech and Signal Processing, ASSB-28:137-145, 1980. This article describes a signal estimation technique where the estimate {circumflex over (xcex8)} is not restricted to 1 or 0, but can be any number in the range 0 to 1. However, the soft decision signal estimation technique described in the article is based on the assumption that the speech signal is a deterministic signal with unknown magnitude and phase. In fact, speech is a random process so the model to estimate the speech signal is not appropriate. Therefore, the signal estimation technique described in the article is not optimal for detection of a speech signal.
The present invention is a soft decision signal estimation algorithm for generating an estimate of a speech signal from a received signal containing both speech and noise components. The received signal is converted to the frequency domain by a Fast Fourier Transform (FFT). In the frequency domain, the received signal is filtered by a Wiener filter to eliminate, as much as possible, the noise component of the signal. The output signal from the Wiener filter is converted back to the time domain by an inverse FFT. The output signal from the Wiener filter is then combined in the time domain with a speech probability estimate generated by a voice activity detector (VAD) to obtain a soft estimate of the speech signal.
A voice activity detector is used to compute the speech probability estimate. In conventional signal estimation, the VAD detects whether the received signal contains a speech component and outputs a hard decision (i.e. 0 or 1). In the present invention, the VAD generates a soft estimate of the probability of speech, called the speech probability estimate, that is combined with the output of the Wiener filter to obtain a soft estimate of the speech signal. To compute the speech probability estimate, the VAD computes a likelihood ratio based on the received signal. The likelihood ratio and the a priori probability of speech are used to compute the speech probability estimate. The likelihood ratio is also used to determine when to update the frequency response of the Wiener filter and VAD filter.