In the voice signal processing field, a technology for detecting the voice activity has been widely used. This technology is called voice activity detection (VAD) in the voice coding field; it is called speech endpoint detection in the speech recognition field; it is called speech pause detection in the speech enhancement field. These technologies focus on different aspects in different scenarios, and thus achieve different processing results. In essence, however, these technologies are used to detect whether a speech exists in the case of voice communications or in a corpus. The detection accuracy has direct influences on the quality of subsequent processes (for example, voice coding, speech recognition and enhancement).
The voice coding technology can reduce the transmission bandwidth of voice signals and increase the capacity of a communication system. In a voice communication, 40% of the time involves voice signals, and the rest involves silence or background noises. Thus, to save transmission bandwidth, VAD may be used to differentiate background noises and non-noise signals, so that the encoder can encode the background noises and non-noise signals with different rates, thus reducing the mean bit rate. In recent years, all the voice coding standards formulated by large organizations and institutions cover specific applications of the VAD technology.
In the conventional art, the VAD algorithms such as VAD1 and VAD2 used in the adaptive multi-rate speech codec (AMR) judge whether a current signal frame is a noise frame according to the signal noise ratio (SNR) of an input signal. VAD calculates estimated background noise energy, and compares the ratio of the energy of the current signal frame to the energy of the background noise (that is, the SNR) with a preset threshold. When the SNR is greater than the threshold, VAD determines that the current signal frame is a non-noise frame; otherwise, VAD determines that the current signal frame is a noise frame. The VAD classification result is used to guide discontinuous transmission system/comfortable noise generation (DTX/CNG) in the encoder. The purpose of DTX/CNG is to perform discontinuous coding and transmission on only noise sequences when the input signal is in the noise period. The noises that are not coded and transmitted are interpolated at the decoder, so as to save bandwidth.
During the implementation of the present invention, the inventor finds the following problem in the conventional art: The VAD algorithm in the conventional art is adaptive according to the moving average of a long-term background noise level, and is not adaptive to the background noise variation. Thus, the adaptability is limited.