1. Technical Field
The present invention generally relates to a noise cancellation apparatus and method and, more particularly, to an apparatus and method that remove noise based on voice characteristics.
2. Description of the Related Art
Since the 1950's, many technologies related to voice recognition have been developed.
Recently, with an increase in cloud-based network processing capacity, an increase in the capacity of a processor and memory for processing voice recognition, and an increase in the necessity of various user interface technologies, voice recognition has attracted attention in various application fields. Based on an increase in network processing capacity and device processing ability, various element technologies are applied, so that a voice recognition rate may be greatly improved in the processing of a natural language as well as an isolating language. By means of this, voice recognition technology may be applied even to application fields requiring the recognition of more words and phrases, and thus the application field of voice recognition technology is expanding.
To improve a voice recognition rate, methods based on various voice recognition technologies have been presented. However, a great variety of technical approaches have been made depending on language models, voice model learning and training, and database (DB) management, as well as application fields. Further, there have been extensive research and development of technology which effectively improves (from the standpoint of performance improvement and complexity reduction) a voice recognition rate by suppressing or cancelling noise contained in voice due to an environment in which voice (speech) is uttered. The present invention is focused on noise cancellation technology and is intended to make an approach to technology areas for improving a voice recognition rate.
Representative noise cancellation technology applied to voice processing (including voice recognition) includes Mel-Frequency Cepstral Coefficients-Minimum Mean Square Error (MFCC-MMSE) technology.
A device to which MFCC-MMSE noise cancellation technology is applied may include a frequency conversion unit for receiving a voice signal in a time domain and converting it into a voice signal in a frequency domain; a power calculation unit for calculating signal power in the frequency domain; a Mel-frequency filter unit for performing filtering in consideration of the frequency domain weight and nonlinearity of the voice signal; a noise cancellation unit for cancelling and suppressing a noise signal by applying an MFCC-MMSE algorithm to the voice signal; an inverse frequency conversion unit for converting the domain of the voice signal using a noise-cancelled signal; a normalization unit for normalizing the received signal by reflecting the gain thereof; and a parameter extraction unit for extracting parameters required for voice recognition using a normalized signal.
Here, the noise cancellation unit is indicated by reference numeral 20 in FIG. 1, and the noise cancellation unit 20 of FIG. 1 may include a parameter estimation unit 21 for receiving signals output from the respective filter banks 10a to 10n of the Mel-frequency filter unit 10 and estimating parameters based on the power (variance) of noise, phase, and voice signals; a gain estimation unit 22 for calculating a MFCC-MMSE gain using the estimated parameters; and a gain application unit 23 for receiving the output signal of the Mel-frequency filter unit 10 and the MFCC-MMSE gain estimated by the gain estimation unit 22 and then performing noise cancellation.
Meanwhile, a noise estimation procedure performed by the parameter estimation unit 21 will be described in detail with reference to the flowchart of FIG. 2.
First, the power of signals and power of noise are extracted (estimated) at step S10.
Then, whether to update noise is determined at step S12. For example, the ratio of signal power calculated in a current frame to the minimum value of signal power is calculated and is compared with a preset threshold value, and then it is determined whether to update noise, based on the results of comparison.
That is, when the ratio of signal power to the minimum value of signal power is equal to or greater than the threshold value, a current section is determined to be a section in which a voice signal is present, and previously estimated noise power is utilized without change at step S14.
In contrast, when the ratio of signal power to the minimum value of signal power is less than the threshold value, the current section is determined to be a section in which a voice signal is not present, and noise power is updated using noise power estimated in a previous frame and noise power calculated in a current frame at step S16.
By means of this scheme, noise power of the current frame is finally determined at step S18.
Here, when a procedure performed at step S12 of determining whether to update noise based on the signal power ratio is represented by an equation, it may be given by the following Equation (1):
                                                                                                                            m                    ...                                    y                                ⁡                                  (                  b                  )                                                                    t            2                                                                                                                  m                    ...                                    n                                ⁡                                  (                  b                  )                                                                    min            2                          >        ϑ                            (        1        )            
In Equation (1), |y(b)|t2, denotes signal power calculated in the current frame and |n(b)|min2 denotes the minimum value of signal power.  denotes a threshold value and is a preset parameter.
Further, when a signal greater than the minimum value by a predetermined ratio is measured, the current section is determined to be a section in which a voice signal is present. That is, since noise power measured in the current frame has an estimated error, the previously estimated noise power is utilized without change. This operation is represented by the following Equation (2):σn2(b)t−1=σn2(b)t−1  (2)Meanwhile, when a signal less than the minimum value by a predetermined ratio is measured, the current section is determined to be a section in which the voice signal is not present, and thus noise power is calculated using the noise power measured in the current frame and the noise power estimated in the previous frame. When this operation is represented by an equation, it may be given by the following Equation (3):σn2(b)t=ασn2(b)t−1+(1−α)|my(b)|t2  (3)where α denotes a coefficient (forgetting factor) used to filter noise power estimated in the previous frame and noise power calculated in the current frame and has a value ranging from [0, 1].
However, a noise power estimation technique in the conventional noise cancellation method estimates the noise power of the current frame using the noise power of the previous frame, thus greatly influencing the entire noise cancellation performance depending on which value is to be set to an initial value of noise power. Therefore, a procedure of determining initial noise power most suitable for a current environment in which voice processing is performed is required.
Further, the conventional noise cancellation method utilizes an Infinite Impulse response (IIR) filter that uses the noise power of a previous frame and noise power calculated in a current frame in a section, in which a voice signal is not present, in order to estimate noise power. As an estimation coefficient (forgetting factor) used at this time, an experimentally determined fixed value is used. In this way, when the fixed forgetting factor is used, there is a problem in that it is difficult to effectively cope with noise characteristics (noise power variation or the like) in various environments. That is, when a forgetting factor of a very large value (≈1) is used in an environment in which noise varies very sharply, it is difficult to track rapidly varying noise power. In contrast, when a forgetting factor of a very small value (≈0) is used in an environment in which noise varies very slowly, a noise estimation error increases, thus negatively influencing noise cancellation performance.
Therefore, in noise cancellation technology for voice processing, there is required a method and apparatus capable of maximizing noise cancellation performance by setting parameters such as an initial noise power value and an IIR filter coefficient to values optimized for an environment.
As related preceding technology, U.S. Patent Application Publication No. 2011-0300806 (entitled “User-Specific Noise Suppression for Voice Quality Improvements”) discloses technology in which an application device used by a single user, such as a cellular phone, improves the performance of voice recognition by performing noise suppression based on the voice features of the user.
As another related preceding technology, there is provided technology related to methods of estimating signal and noise levels because the most important factor upon selecting noise cancellation parameters is to estimate signal and noise levels. That is, as such a method, technology for estimating parameters when a voice signal is not present, and utilizing a fixed value when a voice signal is present is published in a paper by Dong Yu, Li Deng, Jasha Droppo, Jian Wu, Yifan Gong, and Alex Acero, “A Minimum-Mean-Square-Error Noise Reduction Algorithm on Melfrequency Cepstra for Robust Speech Recognition”, ICASSP 2008 1-4244-1484-9/pp.4014-4044.
As further related preceding technology, technology for improving Cochlear Implant (CI) adaptability to background noise by performing noise suppression adaptively to an environment so as to prevent the performance of CI from being degraded in a noise environment is published in a paper by Vanishree Gopalakrishna, Nasser Kehtarnavaz, Taher S. Mirzahasanloo, “Real-Time Automatic Tuning of Noise Suppression Algorithms for Cochlear Implant Applications”, IEEE Trans. on Biomedical Engineering Vol.00, No.00, 2012.