A conventional noise canceling system using a microphone array includes a microphone array having at least one microphone, a short-term analyzer that is connected to each microphone, an echo canceller, an adaptive beamforming processor that cancels directional noise and turns a filter weight update on or off based on whether or not a front sound exists, a front sound detector that detects a front sound using a correlation between signals of microphones, a post-filtering unit that cancels remaining noise based on whether or not a front sound exists, and an overlap-add processor.
In the case of a beamforming technique using a microphone array, a gain of an input signal depends on an angle due to a difference between signals input to microphones. A directivity pattern also depends on an angle.
FIG. 1 illustrates a graph of a directivity pattern when a microphone array is steered at an angle of 90°.
A directivity pattern is defined as in Equation 1:
                              D          ⁡                      (                          f              ,                              a                x                                      )                          =                              ∑                          n              =                              -                                                      N                    -                    1                                    2                                                                                    N                -                1                            2                                ⁢                                          ⁢                                                    w                n                            ⁡                              (                f                )                                      ⁢                                          ⅇ                                  j                  ⁢                                                                          ⁢                  2                  ⁢                  π                  ⁢                                                                          ⁢                                      a                    x                                    ⁢                  nd                                            .                                                          [                  Eqn          .                                          ⁢          1                ]            
where f denotes a frequency, N denotes the number of microphones, d denotes a distance between microphones, wn(f)=an(f)ejφn(f) denotes an amplitude weight, and φn(f) denotes a phase weight.
Therefore, in the beamforming technique, a directivity pattern which is generated when a microphone array is used is adjusted using an(f) and φn(f), and a microphone array is steered to a direction of a desired angle.
It is possible to obtain only a signal of a desired direction through the above-described method.
Next, a Frequency Domain Blind Source Separation (FDBSS) technique is performed.
The FDBSS technique refers to a technique of separating two sound sources which are mixed with each other. The FDBSS technique is performed in a frequency domain. When the FDBSS technique is performed in a frequency domain, an algorithm becomes simplified, and a computation time is reduced.
An input signal in which two sound sources are mixed is transformed to a frequency domain signal through a Short-Time Fourier Transform (STFT). Thereafter, it is converted to signals in which sound source separation is performed through three processes of an independent component analysis (ICA).
A first process is a linear transformation.
In this process, when the number of microphones is larger than the number of sound sources, a dimension of an input signal is reduced to a dimension of a sound source through a transformation (V). Since the number of microphones is commonly larger than the number of sound sources, a dimension reduction part is included in the ICA.
In a second process, the processed signal is multiplied by a unitary matrix (B) to compute a frequency domain value of a separated signal.
In a third process, a separation matrix (V*B) obtained through the first and second processes is processed using a learning rule obtained through research.
After obtaining the separated signal through the above-described processes, localization is performed.
Due to localization, a direction from which a sound source separated by the ICA comes in is discriminated.
The next process is a permutation.
This process is performed to maintain a direction of the separated sound source “as is.”
As a final process, scaling and smoothing are performed.
The scaling process is performed to adjust a magnitude of a signal in which sound source separation is performed so that a magnitude of the signal is not distorted.
To this end, a pseudo inverse of a separation matrix used for sound source separation is computed.
Thereafter, frequency responses that are sampled into L points having an interval of fs/L (fs: a sampling frequency) in the FDBSS are expressed as period signals having a period L/fs in a time domain.
This is a periodic infinite-length filter and not realistic.
For this reason, a filter in which a signal has one period in a time domain is commonly used.
However, in the case of using this filter, signal loss occurs, and separation performance deteriorates.
In order to solve the problem, a smoothing process is necessary.
In the smoothing process, a Hanning window in which both ends gradually smoothly become zero (0) is multiplied, so that a frequency response becomes smooth. As a result, signal loss is reduced, and separation performance is improved.
A technique of separating sound sources as described above is the FDBSS technique.
However, a conventional beamforming technique adjusts a directivity pattern of a microphone array to obtain a signal of a desired direction, but it has a problem in that performance deteriorates when a different sound source is present around the desired direction. That is, the conventional beamforming technique can adjust a directivity pattern to a desired direction more or less, but it is difficult to make a desired direction pointed.
The FDBSS technique has a problem in that there is a performance difference depending on a restriction condition such as the number of sound sources, reverberation, and a user position shift. Further, when the FDBSS is used for voice recognition, a missing feature compensation is necessary.
When two persons speak at the same time and voices are mixed, voice recognition performance significantly deteriorates.
In the conventional directional noise canceling system using the microphone array, a noise is estimated using a probability that a voice will be present, instead of discriminating between a voice and a non-voice, under the assumption that a noise is smaller in energy than a voice.
A noisy voice signal, which is a voice signal having a noise, is input to a microphone array 10. The noisy voice signal is transformed to a frequency-domain signal through a windowing process and the Fourier transform.
Local energy of the noisy voice signal is computed using the frequency-domain signal as in Equation 2:
                                          S            f                    ⁡                      (                          k              ,              l                        )                          =                              ∑                          i              =                              -                w                                      w                    ⁢                                          ⁢                                    b              ⁡                              (                i                )                                      ⁢                                                                            Y                  ⁡                                      (                                                                  k                        -                        i                                            ,                      1                                        )                                                                              2                                                          [                  Eqn          .                                          ⁢          2                ]            
where |Y( )|2 denotes a power spectrum of an input noisy voice signal, k denotes a frequency index, l denotes a frame index, and b=window function, window length=2w+1.S(k,s)=αSS(k,S−1)+(1−αS)Sf(k,S),0<αS<1=smoothingparameter  [Eqn. 3]
where k denotes a frequency index, l denotes a frame index, and b=window function, window length=2w+1.
A minimum value of the local energy is computed as in Equation 4:Smin(k,s)=min{Smin(k,S−1),S(k,S)}  [Eqn. 4]
A ratio between the local energy of the noisy voice and the minimum value is computed as in Equation 5:Sr(k,s)AS(k,s)/Smin(k,s)  [Eqn. 5]
Meanwhile, a threshold value δ is set. If Sr(k,s)>δ, it is determined that a voice is present, and otherwise, it is determined that a voice is not present. This can be expressed as in Equation 6:I(k,s)=1 if Sr(k,S)>δ and I(k,S)=0 otherwise  [Eqn. 6]
A probability value that a voice will be present is computed using a parameter for determining whether or not a voice is present as in Equation 7:{circumflex over (p)}(k,s)=ap{circumflex over (p)}(k,l−1)+(1−αp)I(k,l),where αp(0<αp<1)is smoothing parameter  [Eqn. 7]
Subsequently, noise power is estimated using the probability value that a voice will be present as in Equation 8:{circumflex over (λ)}d(k,l+1)={circumflex over (λ)}d(k,l){circumflex over (p)}(k,l)+[αd{circumflex over (λ)}d(k,l)+(1−αd)|Y(k,l)|2](1−p′(k,l))={tilde over (α)}d(k,l){circumflex over (λ)}d(k,l)+[1−{tilde over (α)}d(k,l)]Y(k,l)|2  [Eqn. 8]
Where {tilde over (α)}d(k,l)≡αd+(1−αd)p′(k,l) and {circumflex over (λ)}d denotes an estimated noise.
As can be seen from Equation 8, when a voice is present, a noise value which is previously estimated is used to compute noise power, while when a voice is not present, a noise value which is previously estimated and a value of an input signal are weighted and added to compute updated noise power.
A technique of determining whether or not a voice is present in an input signal and estimating a noise in a section in which a voice is not present (i.e., a noise section) is referred to as Minima Controlled Recursive Averaging (MCRA) technique.
A second noise canceling technique is a spectral subtraction based on minimum statistic, and noise power estimation is very important in the spectral subtraction technique.
First, an input signal is frequency-transformed and then separated into a magnitude and a phase.
Of the separated values, a phase value is maintained “as is,” and a magnitude value is used.
A magnitude value of a section in which only a noise is present is estimated and subtracted from a magnitude value of the input signal.
This value and the phase value are used to recover a signal, so that a noise-canceled signal is obtained.
A section in which only a noise is present is estimated using a short-time sub-band power estimation of a signal having a noise.
A short-time sub-band power estimation value computed has peaks and valleys as illustrated in FIG. 2.
Since sections having peaks are recognized as speech activity sections, noise power can be computed by estimating sections having valleys.
A technique which uses the computed noise part to cancel a noise through the spectral subtraction method is the spectral subtraction based on minimum statistic.
However, the conventional noise canceling method has a problem in that it cannot detect a change of a burst noise and so cannot appropriately reflect it in noise estimation. That is, the conventional noise canceling method has low performance for a noise which lasts a short time but has as much energy as a voice such as a footstep sound and a keyboard typing sound which are generated in an indoor environment.
Therefore, noise estimation is not accurate, and thus a noise remains. Such a remaining noise makes users uncomfortable in voice communications or causes a malfunction in a voice recognizer, thereby deteriorating performance of the voice recognizer.
That is, since a voice and a non-voice are discriminated such that a section having a value larger than an energy level or a Signal-to-Noise Ratio (SNR) is recognized as a voice section, and a section having a smaller value is recognized as a non-voice section, when an ambient noise, which has as high an energy level as a voice, is input, noise estimation and update are not performed. Therefore, the conventional noise canceling method has low performance for an ambient noise which has as high an energy level as a voice.