1. Field of the Invention
The present invention relates to speech pre-processing method and apparatus apparatus and techniques for digital communication systems. More specifically, the present invention relates to using noise reduction parameters to adjust gain and frequency response of speech signals in Personal Communication Service (PCS) systems.
2. Description of the Related Art
Multiple access in digital communication systems has numerous important practical applications. However, presently available multiple access techniques require that the message corresponding to different users be separated in some manner such that they do not interfere with one another. Generally, this can be achieved by dividing the signal in time or frequency domain. Then, different signals can be separated out by using some form of matched filtering or its equivalent which responds to only a single signal because of the orthogonality of the signals.
There are several ways to achieve signal division of two or more signals. The messages can be separated in time, insuring that different users transmit at different times, in frequency, insuring that the different users use different frequency bands, or, the message can be transmitted at the same time and at the same frequency, but made orthogonal by some other means, such as code division in which the users transmit signals which are guaranteed to be orthogonal through the use of specially designed codes.
Code Division Multiple Access (CDMA) has been the prevailing choice in systems for cellular communication. CDMA allows multiple access by using code sequences as traffic channels in a common transmission channel. By contrast, Time Division Multiple Access (TDMA) requires dividing a transmission channel into many time slots where each slot carries a traffic channel. Also, there is Frequency Division Multiple Access (FDMA) which allows multiple access by dividing an allocated spectrum into different transmission channels. For example, a spectral bandwidth of 1.2 MHz can be divided into 120 transmission channels with a channel bandwidth of 10 kHz. This is a FDMA scheme. A spectral bandwidth of 1.2 MHZ can also be divided into 40 transmission channels with a radio channel bandwidth of 30 kHz but each radio channel carries three time slots. Therefore, a total of 120 time-slot channels are obtained. This is a TDMA scheme. Finally, a spectral bandwidth of 1.2 MHz can also be used as one transmission channel but provide 40 code-sequence traffic channels for each sector of a cell. A cell of three sectors has a total of 120 traffic channels. This is an example of a CDMA scheme. Therefore, in using CDMA communications, the frequency spectrum can be reused multiple times, permitting an increase in system user capacity. The use of CDMA results in a much higher spectral efficiency than can be achieved by using other multiple access techniques.
Currently, there are three industry standards in the CDMA technology which implement voice compression. The CDMA standard, Telecommunication Industry Association-Interim Standard 96 (TIA-IS96), uses "QCELP", "Pure Voice", and IS-127, otherwise known as Enhanced Variable Rate Coder (EVRC) as the three voice compression standards. Of the three standards, only IS-127 has a noise reducing standard. This standard is widely used by digital transmission devices and techniques. A noise reducer (NR) performs noise processing in frequency domain by adjusting the level of the frequency response of each frequency band which results in substantial reduction in background noise without affecting signal integrity.
FIG. 1 illustrates a block diagram of a conventional noise reducer operating at 10 ms frame interval. This noise reducer primarily improves the signal-to-noise ratio (SNR) of the input signal before beginning of speech encoding by operation of the following processes.
Original speech S(n) is passed through a high pass filter 100 which removes unnecessary low frequency noise. The high pass filter 100 initializes filter memory to all zeros, and thereafter filtering takes place in the form of a sixth order Butterworth filter implemented as three cascaded biquadratic sections with a cutoff frequency at 120 Hz.
At frequency domain conversion stage 101, a high pass filtered input signal S.sub.HP (n) is windowed using a smoothed trapezoid window, in which a first D samples of an input frame buffer d(m) (m=current frame) are overlapped from a last D samples of a previous frame d(m-1). In other words, for a sample index n with the input frame buffer d(m) having a frame length L of 80, the overlap in samples is given by the following expression. EQU d(m, n)=d(m-1, L+n); for 0.ltoreq.n&lt;D (1)
The remaining samples (i.e., the non-overlapping portions) of the input frame buffer d(m) are then pre-emphasized at the frequency domain conversion stage 101 to increase the high to low frequency ratio with a pre-emphasis factor .zeta. (here, set at -0.8) according to the following expression EQU d(m, D+n)=S.sub.HP (n)+.zeta..sub.P S.sub.HP (n-1); for i.ltoreq.n&lt;L (2)
This results in the input frame buffer d(m) containing L+D=104 samples in which the first D samples are the pre-emphasized overlap from the previous frame (m-1), and the subsequent L samples are the input from the current frame m.
Next, a smoothed trapezoidal window is applied to the input frame buffer d(m) to form a discrete fourier transform (DFT) data buffer g(n). Thereafter, a transformation of discrete fourier transform data buffer g(n) into frequency domain is performed using DFT to obtain the data buffer in frequency domain G(k).
A conventional transform technique such as a 64-point complex Fast Fourier Transform (FTT) is used to convert the time domain data buffer g(n) to the frequency domain data buffer spectrum G(k). For details on this technique, see Proakis et al., "Introduction to Digital Signal Processing," New York, Macmillan, pp. 721-722 (1988). The resulting spectrum G(k) is used to compute noise reduction parameters for the remaining blocks as explained below.
The frequency domain data buffer spectrum G(k) resulting from the frequency domain conversion 101 is used to estimate channel energy E.sub.ch (m) for the current frame m at channel energy estimator stage 102. Here, 64 point energy bands are computed from the FFT results of stage 101, and are quantized into 16 bands (or channels). The quantization is used to combine low, mid, and high frequency components and to simplify the internal computation of the algorithm. Also, in order to maintain accuracy, the quantization uses a small step size for low frequency ranges, increased the step size for higher frequencies, and uses the highest step size for the highest frequency ranges.
Thereafter, at the channel signal-to-noise ratio estimator stage 104, quantized 16 channel SNR indices .sigma..sub.q (i) are estimated using the channel energy E.sub.ch (m) from the channel energy estimator stage 102, and current channel noise energy estimate E.sub.n (m) from a background noise estimator 109 which continuously tracks the input spectrum G(K), and whose operations will be explained shortly. In order to avoid undervaluing and overvaluing of the SNR, the final SNR result is also quantized at the channel SNR estimator 104. Then, a sum of voice metrics v(m) at stage 105 is determined based upon the estimated quantized channel SNR indices .sigma..sub.q (i) from the channel SNR estimator stage 104. This involves transformation of the actual sum of all 16 signal-to-noise ratio from a predetermined voice metric table with the quantized channel SNR indices .sigma..sub.q (i). The higher the SNR, the higher the voice metric sum v(m). Because the value of the voice metric v(m) is also quantized, the maximum and the minimum values are always ascertainable.
Then, at spectral deviation estimator stage 108, changes from speech to noise and vice versa are detected which can be used to indicate the presence of speech activity of a noise frame. In particular, a log power spectrum E.sub.db (m, i) is estimated based upon the estimated channel energy E.sub.ch (m) (from stage 102) for each of the 16 channels. Then, an estimated spectral deviation .DELTA..sub.E (m) between a current frame power spectrum E.sub.db (m) and an average long-term power spectral estimate E.sub.db (m) is determined. The estimated spectral deviation .DELTA..sub.E (m) is simply a sum of the difference between the current frame power spectrum E.sub.db (m) and the average long-term power spectral estimate E.sub.db (m) at each of the 16 channels. In addition, a total channel energy estimate E.sub.TOT (m) for the current frame is determined by taking the logarithm of the sum of the estimated channel energy E.sub.ch (m) at each frame. Thereafter, an exponential windowing factor .alpha.(m) as a function of the total channel energy E.sub.TOT (m) is determined, and the result of that determination is limited to a range determined by a predetermined upper and lower limits .alpha..sub.H and .alpha..sub.L, respectively. Then, an average long-term power spectral estimate for the subsequent frame E.sub.db (m+1, i) is updated using the exponential windowing factor .alpha.(m), the log power spectrum E.sub.db (m), and the average long-term power spectral estimate for the current frame E.sub.db (m).
With the above variables determined at the spectral deviation estimator stage 108, noise estimate is updated at noise update decision stage 107. Broadly, speaking at the noise update decision stage 107, a noise frame indicator (update.sub.-- flag) indicating the presence of a noise frame can be determined by utilizing the voice metrics v(m) from the voice metric calculation stage 105, and the total channel energy E.sub.TOT (m) and the spectral deviation .DELTA..sub.E (m) from the spectral deviation estimator stage 108. Using these three pre-computed values coupled with a simple delay decision mechanism, the noise frame indicator (update.sub.-- flag) is ascertained.
The delay decision is implemented using counters and a hysterisis process to avoid any sudden changes in the noise to non-noise frame detection.
FIG. 1A illustrates the detailed steps for updating the noise estimate. Initially at step 130, the noise frame indicator is initialized such that it does not indicate a noise frame (i.e., update.sub.-- flag=False). Then, if the voice metric sum v(m) is determined to be less or equal to a predetermined update threshold level (UPDATE.sub.-- THLD) at step 131, the noise frame indicator is initialized to indicate a noise frame (update.sub.-- flag=True), and a background noise update counter is initialized (update.sub.-- cnt=0) at step 132. Here, the predetermined update threshold level (UPDATE.sub.-- THLD) is adjusted at a value of 35.
If the voice metric v(m) is above the predetermined update threshold level (UPDATE.sub.-- THLD), the update logic is forced at step 133. In other words, at step 133, it is determined whether the total channel energy E.sub.tot (m) is greater than a predetermined noise floor level (NOISE.sub.-- FLOOR.sub.-- DB), and further, whether the spectral deviation .DELTA..sub.E (m) is below a predetermined deviation threshold level (DEV.sub.-- THLD). Here, the predetermined deviation threshold level (DEV.sub.-- THLD) is set at a value of 28.
If the total channel energy E.sub.tot (m) is greater than the predetermined noise floor level (NOISE.sub.-- FLOOR.sub.-- DB), and further, if the spectral deviation .DELTA..sub.E (m) is below the predetermined deviation threshold level (DEV.sub.-- THLD), the background noise update counter is incremented by one (update.sub.-- cnt+1) at step 134. Then, at step 135, the background noise update counter (update.sub.-- cnt) is compared with a background noise update counter threshold level (UPDATE.sub.-- CNT.sub.-- THLD) which is set at 50. If it is determined that the update counter is greater than or equal to the background noise update counter threshold level, the noise frame indicator indicates a noise frame (update.sub.-- flag=True) at step 136.
Furthermore, to prevent long term creeping of the background noise update counter (update.sub.-- cnt), the hysterisis process is implemented as follows. If and only if the background noise update counter (update.sub.-- cnt) is equal to a previous update counter (last.sub.-- update.sub.-- cnt), a hysterisis counter (hyster.sub.-- cnt) is increased by one (hyster.sub.-- cnt+1). Otherwise, the hysterisis counter (hyster.sub.-- cnt) is initialized to zero.
Then, a previous update counter (last.sub.-- update.sub.-- cnt) is initialized to the current background noise update counter (update.sub.-- cnt), and then, the hysterisis counter (hyster.sub.-- cnt) is compared with a predetermined hysterisis counter threshold level (HYSTER.sub.-- CNT.sub.-- THLD) which is set at 6. If the hysterisis counter (hyster.sub.-- cnt) is larger, then the background noise update counter (update.sub.-- cnt) is set to zero. In other words, the hysterisis process is implemented only if the hysterisis counter (hyster.sub.-- cnt) falls below the threshold level (HYSTER.sub.-- CNT.sub.-- THLD).
Referring back to FIG. 1, having updated the background noise at stage 107, it is determined whether channel signal-to-noise ratio modification is necessary and to modify the appropriate channel SNR indices .sigma..sub.q (i) at channel gain calculation stage 110. In some instances, it is necessary to modify the SNR value to avoid classifying a noise frame as speech. This error may stem from distorted frequency spectrum. By analyzing the mid and high frequency bands at a channel SNR modifier stage 106, the pre-computed SNR can be modified if it is determined that a high probability of error exists in the processed signal. The above-described process is illustrated in FIG. 1B and explained below.
In order to initially set or reset a channel SNR modification flag (modify.sub.-- flag) which indicates whether modification is necessary, an index counter (index.sub.-- cnt) is initialized (index.sub.-- cnt=0) at step 150. Then a simple iteration is implemented from steps 151 to 156, and another from steps 157 through 165.
More particularly, for a channel frequency index i=N.sub.M to N.sub.c -1, (where N.sub.c =number of channels which is set at 16 in this case, and N.sub.M =5), the following steps are taken. At step 152, the quantized channel SNR indices .sigma..sub.q (i) determined at the channel SNR estimator 104 (FIG. 1) are verified to be greater or equal to a predetermined channel SNR index threshold level (INDEX.sub.-- THLD) which is set at 12. Then the index counter (index.sub.-- cnt) is incremented by one (index.sub.-- cnt+1) at step 153. Thereafter, at step 154, it is determined whether the index counter (index.sub.-- cnt) is less than a predetermined index counter threshold level (INDEX.sub.-- CNT.sub.-- THLD) set at 5. If the index counter (index.sub.-- cnt) is less than the predetermined threshold level (INDEX.sub.-- CNT.sub.-- THLD), a channel SNR modification flag (modify.sub.-- flag) indicates that modification of the channel SNR is necessary (modify.sub.-- flag=True) at step 155. Otherwise, at step 156, the modification flag (modify.sub.-- flag) indicates that the modification is not necessary (modify.sub.-- flag=False), and the modified channel SNR indice .sigma.'.sub.q (i) are not changed from the original values (.sigma.'.sub.q (i)=.sigma..sub.q (i)) at step 163.
If channel SNR modification is necessary (i.e., modify.sub.-- flag=True) as determined at steps 150 to 156, the channel SNR indices .sigma..sub.q (i) are modified to obtain modified channel SNR indices .sigma.'.sub.q (i) at step 163. In other words, if and only if the modification flag (modify.sub.-- flag) indicates that modification is necessary (modify.sub.-- flag=True), an iterative process (steps 157-162 and 165) takes place for each of the 16 channels (i.e., for i=0 to N.sub.c -1).
If the voice metric sum v(m) determined at the voice metric calculation stage 105 (FIG. 1) is determined to be less than or equal to a predetermined metric threshold level (METRIC.sub.-- THLD), or if the channel SNR indices .sigma..sub.q (i) are less than or equal to a predetermined setback threshold level (SETBACK.sub.-- THLD) at step 158, the modified channel SNR indices .sigma.'.sub.q (i) are set to one at step 159. Here, the predetermined metric threshold level (METRIC.sub.-- THLD) is set at 45, while the predetermined setback threshold level (SETBACK.sub.-- THLD) is set at 12. Otherwise, the modified channel SNR indices .sigma.'.sub.q (i) are not changed from the original values (.sigma.'.sub.q (i)=.sigma..sub.q (i)) at step 165.
Thereafter, to limit the modified channel SNR indices .sigma..sub.q above a predetermined channel SNR threshold level .sigma..sub.th (adjusted at 6 here), another iteration is implemented (for i=1 to Nc-1) where it is first determined at step 160 whether the modified channel SNR indices .sigma.'.sub.q (i) are less than the predetermined channel SNR threshold level .sigma..sub.th. If so, the threshold limited, modified channel SNR indices .sigma.".sub.q (i) are set to the predetermined channel SNR threshold level .sigma..sub.th (.sigma.".sub.q (i)=.sigma..sub.th) at step 162. Otherwise, the threshold limited, modified channel SNR indices .sigma.".sub.q (i) are not changed from the modified channel SNR indices .sigma.'.sub.q (i) (i.e., ".sub.q (i)=.sigma.'.sub.q (i)) at step 161.
Referring to FIG. 1, the threshold limited, modified channel SNR indices .sigma.".sub.q (i) are provided to the channel gain calculation stage 110 to determine an overall gain factor .gamma..sub.n for the current frame based upon a pre-set minimum overall gain .gamma..sub.min, a noise floor energy E.sub.floor, and the estimated noise spectrum of the previous frame E.sub.n (m-1). Channel gain .gamma..sub.db (i) (in decibels), determined with a preset gain slope .mu..sub.g and based upon the overall gain factor .gamma..sub.n, the predetermined channel SNR threshold value .sigma..sub.th and the threshold limited, modified channel SNR indices .sigma.".sub.q (i), is then converted to linear channel gains .gamma..sub.ch (i) by taking the inverse logarithm of base 10. The linear channel gains .gamma..sub.ch (i) are then applied to the transformed input signal G(k) by a gain adjuster 103 (FIG. 1) resulting in a noise-reduced signal spectrum H(k). This noise reduced signal spectrum H(k) is then converted into time domain at time domain conversion stage 111 (FIG. 1) producing a time domain noise reduced signal s'(n).
It should be noted that the channel noise energy estimate E.sub.n (m) for the subsequent frame (m+1) is updated if and only if the noise frame indicator indicates a noise frame (update.sub.-- flag=True). The updating is carried out based upon a predetermined minimum allowable channel energy E.sub.min, and a channel noise smoothing factor .alpha..sub.n. Also, the channel noise energy estimate E.sub.n (m) is initialized to the channel noise energy E.sub.n (m) of the first frame, that is, where m=1.
A trade-off exists between the maximum noise reduction effect and the quality of the reconstructed speech. As in the channel energy estimator stage 104, to maintain accuracy in performing the inverse quantization to generate 64 gain values from the 16 channel gains, small step sizes are used for low frequency ranges, step size is increased for higher frequencies, and the highest step is used for the highest frequencies. Depending upon the result from the noise update decision stage 107, the current frequency spectrum G(k) is classified as either noise or speech. If the noise frame indicator (update.sub.-- flag) at the noise update decision stage 107 indicates a noise frame, then the current frequency spectrum G(k) is used and saved for estimating the noise characteristics of the environment in the background noise estimator stage 109.
Under ideal conditions, that is, where neither background noise nor other noise sources exist, a noise reducer is unnecessary. However, since background noise is always present, and therefore, the noise reducer, it would be desirable to be able to control the gain and the frequency response of the voice signal using the already existing parameters of the noise reducer. One approach has been to modify the hardware of the front-end analog circuit. However, this requires additional components which necessarily increases complexity as well as providing another potential source for noise. Therefore, it would be desirable to have a speech signal pre-processing system where the signal gain and its frequency response can be adjusted without adding hardware modification or increase in complexity.