Usually, a robot or a system such as home automation needs to act in response to a person's command by analyzing a signal input during an operation. For realizing this, speaker recognition or voice recognition may be performed by recognizing whether person's voice is included in a signal continuously input to a microphone.
In general, the voice recognition is basically performed by determining the similarity between a reference pattern and a voice pattern to be recognized.
FIG. 1 shows an existing speech recognition system.
The speech recognition system of FIG. 1 includes a voice section detection unit (1) that detects a voice section from an input signal, a characteristic coefficient extraction unit (2) that extracts characteristics from the voice section detected by the voice section detection unit (1) using an MFCC coefficient, a speech recognition unit (3) that recognizes a voice signal using an algorithm of hidden markov model (HMM) and variable multi-section vector quantization (VMS VQ), a database (4) that stores word model parameters learned by the voice signal, and a post-process unit (5) that outputs a word recognized by determining the effectiveness in the voice signal recognized by the speech recognition unit (3).
In the speech recognition system with such a configuration, a pre-process part of the speech recognition system detects an accurate voice section from the input signal, which is a very significant operation as a precondition of determining the performance of the system.
Various techniques have been used as a method for detecting a voice section essential in the voice recognition. In general, the most frequently used method is a method for detecting a voice section using characteristics on a time axis of a voice signal. In other words, the voice signal has high energy at the voice section, has very high similarity between voice samples, and has a minimum voice sustaining time. The voice section is detected by distinguishing background noise and a voice section from each other using characteristics on the time axis of the voice signal.
Incidentally, when heavy ambient noise is present in the voice signal, the characteristics of the voice signal are damaged by the noise, so that it is difficult to detect the voice section. For example, when a signal to noise ratio (SNR) is 0 dB, the signal and the noise have the same energy, whereby the noise and the voice section may not be distinguished from each other through the energy.
When the speaker recognition or the voice recognition is performed on all signals input to the system, a correct result may not be output and unnecessary power consumption may occur in the system. The system needs to extract only voice generated from a desired position by ignoring voice generated from an undesired position and noise which is not voice and is generated from a desired position. When voice generated from a different position and undesired noise (including undesired voice) are simultaneously input, an existing voice section detection algorithm exerts very low performance. Furthermore, when the speaker says something from a desired position while seeing a different position, there is a problem in that speech recognition is performed without distinguishing this even if there is high possibility that it is not the voice desired by the system.