1. Field of the Disclosure
The present disclosure relates to a voice recognition technique, and more particularly to a method and an apparatus for discriminating between a voice region and a non-voice region in an environment in which diverse types of noises and voices exist.
2. Description of the Prior Art
Recently, owing to the development of computers and the advancement of communication technology, diverse multimedia-related techniques, such as a technique for creating and editing various kinds of multimedia data, a technique for recognizing an image or voice from input multimedia data, a technique for efficiently compressing an image or voice, and others have been developed. Accordingly, the technique for detecting a voice region in a certain noise environment may be considered a platform technique that is required in diverse fields including the fields of voice recognition and voice compression. The reason it is not easy to detect the voice region is that the voice content tends to mix with various kinds of noises. Also, even if the voice is mixed with one kind of noise, it may appear in diverse forms such as burst noise, sporadic noise, and others. Hence, it is difficult to discriminate and extract the voice region in certain environments.
Conventional techniques of discriminating between voice and non-voice have some drawbacks. Since these techniques use the energy of a signal as a major parameter, there is no method for discriminating the voice from sporadic noise, which is not easily discriminated from the voice unlike burst noise, it is not possible to predict the performance with respect to unpredicted noise because only one noise source is assumed, and variation of the input signal over time cannot be considered due to only having information about the present frame.
For example, U.S. Pat. No. 6,782,363, entitled “Method and Apparatus for Performing Real-Time Endpoint Detection in Automatic Speech Recognition,” issued to Lee et al. on Aug. 24, 2004, discloses a technique of extracting a one-dimensional specific parameter from an input signal, filtering the extracted parameter to perform edge detection, and discriminating the voice region from the input signal using a finite state machine. However, this technique has a drawback in that it uses an energy-based specific parameter and thus has no measures for sporadic noise, which is considered a voice.
U.S. Pat. No. 6,615,170, entitled “Model-Based Voice Activity Detection System and Method Using a Log-Likelihood Ratio and Pitch,” issued to Lie et al. on Sep. 2, 2003, discloses a method of training a noise model and a speech model in advance and computing the probability that the model is equal to input data. This method accumulates outputs of several frames to compare the accumulated output with thresholds, as well as with a single frame. However, this method has a drawback in that the performance of discriminating an unpredicted noise cannot be secured since it has no model for the voice in a noise environment but creates separate models for noise and voice.
Meanwhile, U.S. Pat. No. 6,778,954, entitled “Speech Enhancement Method,” issued to Kim et al. on Aug. 17, 2004, discloses a method for estimating noise and voice components in real time using a Gaussian distribution and model updating. However, this method also has the drawback that since it uses a single noise source model, it is not suitable in an environment in which a plurality of noise sources exist, and it is greatly affected by the input energy.