1. Field of the Invention
The present invention relates to a speech detection method, and more particularly to a speech distinction method that effectively determines speech and non-speech (e.g., noise) sections in an input voice signal including both speech and noise data.
2. Description of the Background Art
A previous study indicates a typical phone conversation between two people includes about 40% of speech and 60% of silence. During the silence period, noise data is transmitted. Further, the noise data may be coded at a lower bit rate than for speech data using Comfort Noise Generation (CNG) techniques. Coding an input voice signal (which includes noise and speech data) at different coding rates is referred to as variable-rate coding. In addition, variable-rate speech coding is commonly used in wireless telephone communications. To effectively perform variable-rate speech coding, a speech section and a noise section are determined using a voice activity detector (VAD).
In the standard G.729 released by the Telecommunication Standardization Sector of the International Telecommunications Union (ITU-T), parameters such as a line spectral density (LSF), a full band energy (Ef), a low band energy (El), a zero crossing rate (ZC), etc. of the input signal are obtained. A spectral distortion (ΔS) of the signal is also obtained. Then, the obtained values are compared with specific constants that have been previously determined by experimental results to determine whether a particular section of the input signal is a speech section or a noise section.
In addition, in the GSM (Global System for Mobile communication) network, when a voice signal is input (including noise and speech), a noise spectrum is estimated, a noise suppression filter is constructed using the estimated spectrum, and the input voice signal is passed through noise suppression filter. Then, the energy of the signal is calculated, and the calculated energy is compared to a preset threshold to determine whether a particular section is a speech section or a noise section.
The above-noted methods require a variety of different parameters, and determine whether the particular section of the input signal is a speech section or noise section based on previously determined empirical data, namely, past data. However, the characteristics of speech are very different for each particular person. For example, the characteristics of speech for people at different ages, whether a person is a male or female, etc. change the characteristic of speech. Thus, because the VAD uses the previously determined empirical data, the VAD does not provide an optimum speech analysis performance.
Another speech analysis method to improve on the empirical method uses probability theories to determine whether a particular section of an input signal is a speech section. However, this method is also disadvantageous because it does not consider the different characteristics of noises, which have various spectrums based on any one particular conversation.