1. Technical Field
The present invention relates generally to an apparatus and method for improving voice recognition and, more particularly, to an apparatus and method that are capable of improving voice recognition rate in a voice recognition process.
2. Description of the Related Art
Voice recognition is problematic in that recognition rate is reduced by surrounding noise, other than voice. In general, a reduction in recognition rate in the word level of a voice recognizer may be viewed as resulting from the distortion of a voice signal attributable to surrounding noise. The reason why the distortion of the voice signal reduces voice recognition rate is that a resulting value that cannot be determined to be a specific state when compared with a leant acoustics database included in a voice recognizer is derived. This problem occurs in most voice recognizers that perform voice recognition based on a hidden Markov model (HMM) algorithm.
Voice recognizers based on an HMM algorithm extract data called a Mel-frequency cepstrum coefficient (MFCC) on a specific time unit basis. The MFCC extracted on a specific time unit basis is transferred to the decoder part of the voice recognizer, and voice recognition decoding is performed according to the process of an HMM algorithm based on actually learnt acoustics and language databases.
In this case, the voice recognition rate is reduced due to a problem that occurs when an MFCC value distorted due to surrounding noise is transferred to the decoder of the voice recognizer. The voice recognition rate may be improved by appropriately removing or compensating for a noise component.
In the past research, there was proposed a method of removing a noise signal from a voice signal in the time or frequency domain. Research into this method has been carried out regardless of the field of voice recognition.
However, this method is disadvantageous in that the distortion of voice different from the learnt database of a voice recognizer may be generated. In general, in this method, the signal to noise ratio (SNR) regarding noise and a voice signal is estimated and multiplied by a gain value in the frequency or time domain. If an erroneous SNR value is estimated, the recognition rate is reduced or a high noise removal effect may not be obtained. Furthermore, a problem arises in that computational complexity increases because the influence of noise must estimated with respect to each frequency value.
In the case of an HMM-based voice recognizer illustrated in FIG. 1, an MFCC generation unit 110 generates an MFCC 120 based on received voice data 100. A monitoring probability calculation unit 130 and a Viterbi decoder calculation unit 140 perform sequential calculation processes on the MFCC 120, thereby being able to obtain a voice recognition result 150. In this case, the monitoring probability calculation unit 130 and the Viterbi decoder calculation unit 140 must receive data from an acoustics model database 160 and a language model database 170, i.e., voice recognition learning data. In this case, the monitoring probability calculation unit 130 and the Viterbi decoder calculation unit 140 may be viewed as corresponding to the decoder of the voice recognizer.
As illustrated in FIG. 2, the HMM-based voice recognizer performs a process of searching for an optimized path within a voice search network on a voice feature data (called an MFCC) basis.
The voice recognizer may calculate the probability (monitoring probability) 200 of corresponding to internal states 220 forming the voice search network via an already learnt acoustics database using a Gaussian mixture model (GMM) function. The variance and probability value of each of the states 220 for the calculation are stored in a learning database. Furthermore, transition probabilities 210 and 230 between the state 220 and the state 220 are stored as learning data.
If an MFCC is input to the voice recognizer on a hourly basis, the voice recognizer searches for an optimized path within the voice search network using the monitoring probability 200 and the transition probabilities 210 and 230. This process is the same as that of a Viterbi decoder. Accordingly, a Viterbi decoder is used in an HMM-based voice search process. That is, a word including a pronunciation corresponding to an optimized path becomes a voice recognition result.
As described above, the HMM-based voice recognizer determines a case where an MFCC is input on a time unit basis and the transition probabilities 210 and 230 and the monitoring probability 200 have a maximum cumulative value to be an optimized path. In general, there is a good possibility that an optimized path is a search result different from a state change path, corresponding to the utterance of a speaker, due to unwanted surrounding noise. This corresponds to the misrecognition of the voice recognizer.
In order to solve the above problem, in a conventional voice recognizer illustrated in FIG. 3, a noise processor 310 for separating only a voice signal from a signal 300 mixed with noise or compensating for the voice signal is disposed in front of an MFCC generation unit 330.
In general, a method for processing noise in voice is used in the noise processor 310 of FIG. 3. Attempts have been made to estimate a gain value for predicting and correcting the SNR of voice and noise in the frequency domain.
In this method, as illustrated in FIG. 4, in order to transform the signal 300 mixed with noise into a frequency domain and analyze a noise signal, the signal 300 undergoes a fast Fourier transformer (FFT) 311, thereby obtaining an output value for each frequency. Furthermore, the voice signal and noise signal of each of the output values of the FFT 311 undergo an SNR estimation unit 312, a gain generation unit 313, and a noise signal compensation unit 314. Accordingly, a series of processes for improving the voice signal in the frequency domain is performed on the voice and noise signals. The improved voice signal in the frequency domain undergoes an inverse FFT 315, thereby obtaining a voice signal 316 from which noise has been removed.
A change in the probability and statistics of noise for each frequency is highly influenced by a surrounding environment based on the utterance location of a user. If a change in noise factors based on the surrounding environment of a user is small, the step of removing noise does not need to be complicated. In particular, statistical and probabilistic changes in the frequency domain of noise factors that obstruct voice recognition in an indoor environment, such as an office or a home, are very limiting. The characteristic becomes more prominent in voice words of a relatively short time unit, such as a word for search for the title of a program.
In general, in a conventional short time unit frequency domain noise analysis model, voice data in the frequency domain is obtained by repeatedly performing an FFT operation on time domain sampling data of a 20 to 30 ms unit in intervals of about 10 ms. Voice or sound signals in the frequency domain are characterized in that they can be easily analyzed statistically and probabilistically. If a change in power between frequencies is theoretically independent and surrounding noise exhibits a tendency to white noise, it can be stably predicted statistically. Due to the characteristic of sound signals, in this conventional technology, the task of dividing sampling data of a size ranging from 20 to 30 ms into frames having intervals of 10 ms, organizing changes in power value in the frequency domain into a probability model using a Gaussian distribution, and predicting and correcting the SNR of voice and noise is performed with respect to each frequency. As described above, although the conventional noise cancelling technology is theoretically elaborate, it requires a complicated procedure and complicated computation.
As a related art, U.S. Patent Application Publication No. 2010-0153104 entitled “Noise Suppressor for Robust Speech Recognition” discloses a technology in which SNR is predicted and compensated for using the output energy or power of a filter bank in which the auditory sense system of a person has been taken into consideration, thereby highly reducing complexity compared to its preceding technology.
As another related art, the thesis “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator, Y. Ephraim and D. Malah, April, 1985, IEEE TRANSACTIONS ON Acoustics, Speech, And Signal Processing, Vol. ASSP-33, No. 2” proposes an algorithm that converts a sound signal in the time domain into a signal in the frequency domain, statistically and probabilistically models changes in the power and energy of each frequency, and then removes a noise signal component.
As yet another related art, the thesis “Robust Speech Recognition Using a Cepstral Minimum-Mean-Square-Error-Motivated Noise Suppressor, Dong Yu, Li Deng, Jasha Droppo, Jian Wu, Yifan Gong, Alex Acero, JULY, 2008, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, Vol. 16, No. 5” discloses an improvement to a “log-MMSE suppressor” scheme in the frequency domain, i.e., a conventional noise cancellation technology.