In a voice communication system, by using a Voice Activity Detection (VAD) technology, the time when a voice is activated is known, so that signals are transmitted only when the voice is in an activated state, thus effectively saving bandwidth resources. In addition, because in the voice communication system, a voice signal input by a speaker to a terminal usually includes background noise, by using a Noise Suppression (NS) technology, the background noise included in the voice can be effectively reduced or suppressed, thus significantly improving experience of a listener.
In VAD, determining whether a current signal is voice or not in essence depends on whether features of the current signal are closer to features of background noise or closer to features of a voice, and the current signal belongs to the one whose features are closer to the features of the current signal. In NS, in order to reduce an effect background noise imposes on a voice, some features of the current background noise are also required to be known, so that the features can be removed from a voice signal, thus suppressing the noise. Both the VAD and the NS involve a key technology, that is, background noise tracking.
Currently, a widely used background noise tracking technology is a background noise tracking technology used in Audio/Modem Riser VAD2. According to the technology, a Signal to Noise Ratio (SNR) of a current frame is calculated. If the SNR is small, and is lower than a background noise threshold, the current frame is determined as a background noise frame; if the SNR is not lower than a background noise threshold, pitch and tone features of the current frame are detected. If the current frame has the pitch and tone features, a hysteresis counter is increased by 1; otherwise, spectrum fluctuations of the current frame and several adjacent frames before the current frame are further calculated. If the spectrum fluctuation of the current frame is violent, and exceeds a threshold, it is determined that the current frame may not be a noise frame, and the hysteresis counter is increased by 1; otherwise, it is determined that the current frame may be a noise frame, and a continuous noise frame counter is increased by 1. If the continuous noise frame counter reaches 50 frames, it can be determined that the current frame shall be a background noise frame. In addition, during increasing of the continuous noise frame counter, a small number of undetermined frames are allowed (represented by the hysteresis counter). When the continuous noise frame counter reaches 50 frames, and if the hysteresis counter is not greater than 6 (that is, the number of the undetermined frames is not greater than 6), the current frame is determined as a noise frame, that is the determination of the current noise frame is not affected in this case. If the hysteresis counter exceeds 6 frames during the increasing of the continuous noise frame counter, the continuous noise frame counter is reset, and a current signal is not determined as background noise.
However, the above background noise tracking technology has a drawback on tracking speed. When a sudden change happens to background noise (a change leading to increasing of the SNR, for example, a sudden rise of a noise level), a noise signal cannot be identified by using the SNR and a background noise threshold, and the identification can only be performed when 50 continuous noise frames emerge, thus resulting in the slow tracking. If a person speaks at a high frequency, the requirement of the 50 noise frames cannot be met, and the AMR VAD2 cannot track the background noise. Additionally, the above background noise tracking technology has a drawback on tracking accuracy. Because many music signals do not have obvious pitch and tone features, if the condition that the continuous noise frame counter is greater than or equal to 50 and the hysteresis counter is not greater than 6 is followed, some music signals are mistakenly determined as background noise.