1. Field
Apparatuses and methods consistent with the present disclosure relate to an electronic device and a method for recognizing a speech, and more particularly, to an electronic device and a method for detecting a speech section in an audio signal.
2. Description of the Related Art
A speech recognition technology controlling various kinds of electronic devices using a speech signal has been widely used. Generally, the speech recognition technology means a technology of understanding an intention of an uttered speech of a user from a speech signal input from hardware or software device or a system and performing an operation based on the understood intention.
However, the speech recognition technology recognizes various sounds generated from the surrounding environment as well as a speech signal for the uttered speech of the user and therefore may not correctly perform the intended operation of the user.
Therefore, various speech section detection algorithms for detecting only a speech section for an uttered speech of a user from an input audio signal have been developed.
As a general method for detecting a speech section, there are a method for detecting a speech section using energy for each audio signal in a frame unit, a method for detecting a speech section using zero crossing for each audio signal in a frame unit, a method for extracting a feature vector from an audio signal in a frame unit and detecting a speech section by determining existence and nonexistence of a speech signal from the pre-extracted feature vector using a support vector machine (SVM), or the like.
The method for detecting a speech section using energy of an audio signal in a frame unit or zero crossing uses energy or zero crossing for audio signals for each frame. As a result, the existing method for detecting a speech section has relatively smaller computation for determining whether the audio signals for each frame are the speech signal over other methods for detecting a speech section but may often cause an error of detecting a noise signal as well as the speech signal as the speech section.
Meanwhile, the method for detecting a speech section using a feature vector extracted from an audio signal in a frame unit and a SVM has more excellent detection accuracy for only the speech signal from the audio signals for each frame over the method for detecting a speech section using the foregoing energy or zero crossing but requires more computation to determine the existence and nonexistence of the speech signal from the audio signals for each frame and therefore may consume much more CPU resources over other methods for detecting a speech section.