1. Field of the Disclosure
The present disclosure relates generally to voice recognition technology, and more particularly, to a method and apparatus for distinguishing a voice region from a non-voice region in an environment where various types of noise and a voice are mixed together.
2. Description of the Related Art
Recently, with the development of computers and the advancement of communication technology, various multimedia-related technologies have been developed, including technology for generating and editing various types of multimedia data, technology for recognizing video/voice among input multimedia data, and technology for compressing video/voice more efficiently. Of the technologies, the technology for detecting a voice region in a noisy environment is a basic technology essential to various fields such as the voice recognition field and the voice compression field. However, it is not easy to detect a voice region because the voices are mixed with various types of noise. Furthermore, there are various types of noise such as continuous noise and burst noise. Accordingly, in such an arbitrary environment, it is not easy to both detect a region in which voices exist and then to extract the voices.
As a result, the accurate detection of a voice region in a noisy environment plays an important role in improving voice recognition and the enhancement of convenience for a user. The technology for distinguishing a voice region from a non-voice region and detecting the voice region mainly includes a field using frame energy as in U.S. Pat. No. 6,658,380, a field using time-axis filtering as in U.S. Pat. No. 6,782,363 (hereinafter referred to as “patent '363”), a field using frequency filtering as in U.S. Pat. No. 6,574,592 (hereinafter referred to as “patent '592”) and a field using the linear transformation of frequency information as in U.S. Pat. No. 6,778,954 (hereinafter referred to as “patent '954”).
As patent '945, the present invention pertains to the field using the linear transformation of frequency information, but it is different in that it is not based on a probabilistic model but uses a rule-based approach, unlike patent '945.
Patent '363 calculates voice region detection parameters through feature parameter filtering in order to detect energy-based one-dimensional feature parameters, and has a filter for edge detection. Furthermore, patent '363 is configured to detect a voice region using a finite state machine. The technology disclosed in patent '363 is advantageous in that only a small amount of calculation is required and end points are detected regardless of noise level, but is problematic in that there is no solution for burst noise because energy-based one-dimensional feature parameters are used.
Furthermore, patent '592 discloses a technology for detecting voices using the energy of an output signal that has passed through a band pass filter that is adjusted to the voice frequency band. In this process, both length and size information are used. Patent '592 is advantageous in that a voice region can be detected using a relatively small amount of calculation, but is problematic in that it is impossible to detect a voice signal having low energy and the start portion of a consonant having low energy in the voice signal, and it is difficult to determine a threshold value, and variation in the threshold value affects the performance thereof.
Meanwhile, patent '954 discloses a technology for performing real-time modeling for noise and voices using a Gaussian distribution, updating models by estimating voices and noise even if voices and noise are mixed with each other, and removing noise based on a Signal-to-Noise Ratio (SNR) estimated through the modeling. However, patent '954 uses single noise source models so that there is a problem in that it is considerably affected by input energy.
The problems of the conventional technologies are summarized as follows. First, a parameter value varies depending on the amount of noise. Second, a threshold value must be varied according to the energy of a noise signal.