A noise suppression apparatus for suppressing non-object signals, for example, noises superimposed on voice signals is disclosed, for example, in Japanese Patent Application Laid-Open (JP-A) No. 8-221093. The theoretical grounds of the apparatus disclosed therein is the so-called Spectral Subtraction Method (SS method), which focuses on the amplitude spectrum. This method is introduced in document 1 (Steven F. Boll, “Suppression of Acoustic noise in speech using spectral subtraction”, IEEE Trans. ASSP, Vol. ASSP-27, No. 2, April 1979).
The conventional noise suppression apparatus disclosed in JP-A No. 8-221093 is explained below, referring to FIG. 13. In FIG. 13, reference numeral 101 denotes a framing processing unit, 102 denotes a windowing processing unit and 103 denotes a Fast Fourier Transformation processing unit. Reference numeral 104 denotes a band dividing unit, 105 denotes a noise estimation unit, 106 denotes an NR value calculation unit, 107 denotes an Hn value calculation unit, 108 denotes a filter processing unit, 109 denotes a band conversion unit, 110 denotes a spectrum correction unit, 111 denotes an inverse Fast Fourier Transformation processing unit, 112 denotes an overlap adding unit, 113 denotes a voice signal input terminal, 114 denotes a voice signal output terminal, and 115 denotes an output signal terminal. Inside the noise estimation unit 105, reference numeral 121 denotes an RMS calculation unit, 122 denotes a relative energy calculation unit, 123 denotes a maximum RMS calculation unit, 124 denotes an estimated noise level calculation unit, 125 denotes a maximum SNR calculation unit and 126 denotes a noise spectrum estimation unit.
The principle of the function of the conventional noise suppression apparatus will be explained below.
An input voice signal y [t], which includes a voice signal component and a noise component is input into the voice signal input terminal 113. The input signal y [t] is a digital signal, which has been sampled under a sampling frequency FS, for example. Then, the signal is sent to the framing processing unit 101 so as to be divided into frames, each of which has a frame length of FL. Thereafter the signal processing is carried out frame by frame.
Prior to the calculation in the Fast Fourier Transformation processing unit 102, each of the framed signal yframe [j, k] sent from the framing processing unit 101 is windowed in the windowing processing unit 102. Here j denotes a sampling number and k denotes a frame number.
The signal undergoes, for example, a 256 points Fast Fourier Transformation in the Fast Fourier Transformation unit 103. The values of the obtained frequency spectrum amplitude are divided into, for example, 18 bands in the band dividing unit 104. The band divided input signal spectrum Y [w, k] is sent to the spectrum correction unit 110 along with the noise spectrum estimation unit 126 and the Hn value calculation unit 107 in the noise estimation unit 105. Here w denotes a band number.
Then, the framed signals yframe [j, k] are discriminated into the voice signal frames and noise frames in the noise estimation unit 105 so that noise like frames are identified. Simultaneously the estimated noise level value and the maximum SNR (Signal to Noise ratio) are sent to the NR calculation unit 106.
The RMS calculation unit 121 calculates the root mean square (RMS) of each signal component in each frame, and outputs the result as an RMS value RMS [k].
The relative energy calculation unit 122 calculates the relative energy of a k-th frame, which relates to the attenuation energy in connection with the former frame, and outputs the result.
The maximum RMS calculation unit 123 obtains a maximum RMS value. The maximum RMS value is necessary for estimating an estimated noise level value described later and a so-called maximum SNR, which is a proportion of the signal level to the estimated noise level. The maximum RMS value is outputted as the maximum RMS value MaxRMS [k].
The estimated noise level calculation unit 124 selects the minimum RMS value among the RMS values of the last five frames of the current frame (local minimum values), to output it as an estimated noise level value MinRMS [k]. The minimum RMS value is preferable to estimate the background noise or the background noise level.
The maximum SNR calculation unit 125 calculates the maximum SNR MaxSNR [k], on the basis of the maximum RMS value MaxRMS [k] and the estimated noise level value MinRMS [k].
The noise spectrum estimation unit 126 calculates a time averaged estimated value N [w, k] of the background noise spectrum, based on RMS value RMS [k], the relative energy, the estimated noise level value MinRMS [k] and the maximum RMS value MaxRMS [k].
The NR value calculation unit 106 calculates the NR [w, k], which is used in avoiding a sudden change of the filter response.
The Hn value calculation unit 107 generates a filter Hn [w, k] for removing the noise signal from the input signal, on the basis of the band divided input signal spectrum Y [w, k], the time averaged estimated value N [w, k] of the noise spectrum and the output NR [w, k] of the NR value calculation unit 106. The filter Hn [w, k] generated in this unit has a response characteristic that the noise suppression increases when the noise component is larger than the voice signal component, and decreases when the voice component is larger than the noise component.
The filter processing unit 108 smoothes the value of the filter Hn [w, k] on the frequency base as well as on the time base. The smoothing on the frequency base is carried out by the median filtering processing. An AP smoothing is carried out on the time base only in voice signal sections and in noise sections, and the smoothing is not carried out for the signals in transient sections.
The band conversion unit 109 carries out an interpolation processing of the value of the band divided filter, which is sent from the filter processing unit 108, so as to adapt it for inputting into the inverse Fast Fourier Transformation unit 111. The spectrum correction unit 110 multiplies the output of the Fast Fourier Transformation unit 103 by the aforementioned interpolated value of the filter so that a spectrum correction processing, in other words, a noise component deduction processing, is carried out. The spectrum correction unit 110 outputs the noise remaining signal.
The inverse Fast Fourier Transformation processing unit 111 carries out the inverse Fast Fourier Transformation, on the basis of the noise deducted signal obtained in the spectrum correction unit 110, and outputs the obtained signal as a signal IFFT. The overlap adding unit 112 carries out an overlap addition of the signal IFFT at the boundary portions of each of the frames. The obtained output voice signal is outputted from the voice signal output terminal 114.
In the aforementioned noise reducing apparatus, the filter removes the noise spectrum from the input spectrum, corresponding to the proportion of the estimated noise signal to the input voice signal (estimated SNR) as well as the noise signal level. The spectral suppression processing is carried out, by controlling the filter characteristic, according to the distribution of the voice signal and the noise signal. The distortion of the object signal is restricted to the minimum and a large suppression of the noises are secured, and thus the aforementioned noise reducing apparatus has some excellent characteristics. However, the conventional apparatus also has the following problems.
Because the grounds of the control are the estimated noise signal level and the estimated SNR, the noise suppression can not be appropriately carried out when the estimation of the estimated noise signal level is not correct. In such a case, signals are excessively suppressed.
In the control of a suppression amount using the estimated noise signal, the estimated noise signal is generated from the average spectrum of the past frames which were identified to be noise signal. Therefore, when the input voice signal level changes suddenly, for example, at the head portion of words in speech, a time-lag occurs in controlling the filter. As a result, one feels that head portion of words in speech is extinguished or hidden, or a strange sound is heard.