I. Field
Apparatuses and methods consistent with the present disclosure relate to an electronic device and method capable of voice recognition, and more particularly, to an electronic device and method capable of detecting a voice section from an audio signal.
II. Description of the Related Art
The technique of controlling various electronic devices using voice signals is being widely used. In general, a voice recognition technique refers to a technique of, when a voice signal is input into a software device, a hardware device, or a system, identifying an intention of an uttered voice of a user from the input voice signal, and of performing an operation accordingly.
However, such a technique may have a problem that not only a voice signal of the uttered voice of the user but also other various sounds generated in its peripheral environment may be recognized, and thus the operation intended by the user may not be performed properly.
Therefore, various voice section detection algorithms for detecting only a voice section with respect to the uttered voice of a user from an input audio signal are being developed.
General voice section detecting methods include a method for detecting a voice section using the energy of each audio signal of frame units, a method for detecting a voice section using a zero crossing ratio of each audio signal of frame units, and a method for extracting a feature vector from an audio signal of frame units and then determining whether or not an audio signal per frame is a voice signal from a pre-extracted feature vector using an SVM (Support Vector Machine).
The method of detecting a voice section using the energy or the zero crossing ratio of an audio signal of frame units uses the energy or the zero crossing ratio of an audio signal per frame. Therefore, such a conventional voice section detection method may have relatively less computations for determining whether or not an audio signal per frame is a voice signal, but there may be a problem that an error may occur as a voice section may be detected not only for a voice signal but also for a noise signal.
Meanwhile, the method for detecting a voice section using a feature vector extracted from an audio signal of frame units and SVM has more precision in detecting only a voice signal from an audio signal per frame compared to the aforementioned method for detecting a voice section using the energy or zero crossing ratio, but since it takes a lot of computation amount for determining whether or not an audio signal is a voice signal, there may be a problem that a lot of CPU resources are consumed compared to other voice section detection methods.